A Graphical View of Bayesian Variable Selection 

Zaili Fang and Inyoung Kim* 
March 8, 2013 

Department of Statistics, Virginia Polytechnic Institute and State University, Blacksburg, 
Virginia, U.S.A. 

*To whom correspondence should be addressed: 
Inyoung Kim, Ph.D. 

Department of Statistics, Virginia Polytechnic Institute and State University, 410A Hutch- 
eson Hall, Blacksburg, VA 24061-0439, U.S.A. 
Tel: (540) 231-5366 
Fax: (540) 231-3863 
Email: inyoungk@vt.edu 



Abstract 

In recent years, Ising prior with the network information for the "in" or "out" binary 
random variable in Bayesian variable selections has received more and more attentions. 
In this paper, we discover that even without the informative prior a Bayesian variable 
selection problem itself can be considered as a complete graph and described by a Ising 
model with random interactions. There are many advantages of treating variable se- 
lection as a graphical model, such as it is easy to employ the single site updating as 
well as the cluster updating algorithm, suitable for problems with small sample size 
and larger variable number, easy to extend to nonparametric regression models and 
incorporate graphical prior information and so on. In a Bayesian variable selection 
Ising model the interactions are determined by the linear model coefficients, so we 
systematically study the performance of different scale normal mixture priors for the 
model coefficients by adopting the global-local shrinkage strategy. Our results prove 
that the best prior of the model coefficients in terms of variable selection should main- 
tain substantial weight on small shrinkage instead of large shrinkage. We also discuss 
the connection between the tempering algorithms for Ising models and the global-local 
shrinkage approach, showing that the shrinkage parameter plays a tempering role. The 
methods are illustrated with simulated and real data. 

Keywords: Cluster Algorithm; Global-Local Shrinkage; Graphical Model; Ising 
Model; KM Model; Long Tail Prior; Mixture Normals; Tempering Algorithm; Vari- 
able Selection. 
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1 Introduction 



In this paper, we consider the standard multiple linear regression model [y \/3, <p\ ~ N(Xf3, 
where y is n x 1 vector of the response variable, X = (x 1; x p ) is an n x p matrix of predic- 
tors, f3 = (f3i, (3 P ) T is p x 1 model coefficient vector of the full model with (3j, j = 1, ...,p 
corresponding to the jth predictor, and is the precision parameter. The "in" or "out" 
of the predictor is represented by a binary random variable 7j. The Bayesian spike and 
slab approaches to sampling 7j's have been introduced by different authors and maintain 
one of the most active research areas in Bayesian statistics, such as the Stochastic Search 



Variable Selection (SSVS) (George and McCulloch, 1993) and rescaled spike and slab model 



(Ishwaran and Rao, 2005). In recent years, incorporating networked prior information of the 



predictor into those Bayesian variable selection models has received many attentions (Li and 



2010 


Stingo et al. 


2011 


Tai et al. 


2010) 



the network information of the predictors are introduced through an informative prior for 
7j's, which is a binary random graph, but none of them treat the variable selection as a 
graphical model when the prior is noninformative. A binary random graphical model for the 
random vector 7 = (jx, ...,j p ) T is represented by an undirected graph G = (V,E), where V 
represents the set of p vertices or nodes corresponding to p predictors and E is a set of edges 
connecting neighboring nodes. In this paper, based on a reparameterized Bayesian variable 



selection model, KM model (Kuo and Mallick, 1998), we generalize the Bayesian variable 



selection problem into a Bayesian graphical model, referred to Bayesian Variable Selection 
Graphical Model (BVGM), and demonstrate that with the noninformative prior for 7 the 
model is essentially a complete graphical model. 

The Markov chain random process on a random binary graph can be well modeled by a 
Ising model conditional on /3 and <fi. Thus the posterior distribution of jj = 1 or the pos- 
terior distribution of jth predictor "in" the model can be achieved by sampling the random 



binary variable. As one of the most active research areas, abundant theories and sampling 



procedures for Ising model have been reported. A nice review can be found in (Iba, 2001 



Newman and Barkema, 1999). One difficulty to sample 7 in BVGM is that the interactions 



are random since they are expressed by the product of /3/s and 0. Another difficulty is due to 
the long-range interaction of the complete graph where each node is coupled or neighboring 
with all other nodes. In the literature, the well known approaches to handle random and 
long-range interactions are the cluster algorithm and a family of exchange Monte Carlo, par- 



allel tempering and simulated tempering algorithm (Iba, 2001). For the issue of the cluster 



algorithm, Nott and Green (2004) introduced the Swendsen-Wang algorithm (Swendsen and 



Wang, 1987) into Bayesian variable selection and Monni and Li (2010) discussed the Wolff 



algorithm (Wolff, 1989) in the study of network-structured genomics data. However, both 
algorithms are constructed based on the graph prior for 7 and consider fixed interaction 
only. Therefore, both are not applicable to the more general random complete graphical 
model. In this paper, we generalize the cluster algorithm based on Wolff's approach so that 
the cluster is formed even with the random interactions among nodes. Furthermore, our 
generalized Wolff algorithm is introduced for complete graph with noninformative prior for 
7, and it is straightforward to combine the graphical prior information. 

For the issue of tempering algorithm, so far to our best knowledge there are no work dis- 
cussing the connection between the tempering algorithm and Bayesian variable selection. In 
all the Bayesian variable selection models, there is always a critical parameter associated with 
penalization or shrinkage. By showing the variable selection problem as a Ising model, we 
address that the well known shrinkage parameter in Bayesian variable selection is equivalent 
to the temperature parameter in a Ising model. However, in the regular tempering algorithm, 
there is only one global temperature as a random variable. In BVGM, we adopt the global 



shrinkage and local acting strategy (JPolson and Scott, 2011, 2012), which is employed by 
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assigning the priors of scale normal mixtures for /Jj's ( Barndorff-Nielsen et al. 1982 West 



1987). Each j3j has a local shrinkage parameter, its normal precision parameter, as the local 



temperature, and there is another global shrinkage parameter to place a constrain on all 
local parameters. Furthermore, assigning different prior for fy's precision parameter leads 
to different performance. The widely known priors for p(/3j) in this area include Student-t 



(normal/gamma) prior (Tipping, 2001), Laplace (normal /inverse gamma) prior (Carlin and 



Poison, 1991 Hans, 2009; Park and Casellaj 2008), horseshoe prior (Carvalho and Poison 



2010), and Jeffrey's prior (Bae and Mallick, 2004). 



Another issue we concern in this paper is the dynamics of the selection probability under 
different shrinkage. In Bayesian variable selection with large p, instead of the appearance 
frequency of one of the 2 P possible models, usually the predictors are selected according to the 
posterior marginal selection probability, pi^fj = l|y), since the frequency of one specific model 
is extremely small. We define the curves of the selection probabilities of all predictors against 
the shrinkage parameter as the profile curves of BVGM . These profile curves are important 
because they provide a direct view about how to select the shrinkage parameter. They also 
assess the performance of different priors for (3. Unfortunately, we have not seen any work 



study the overall profile of selection probability under a wide range of shrinkage expect Lykou 



and Ntzoufras (2012) where they studied the selection probability against the shrinkage 



with Laplace prior only. Hence one purpose of this paper is to systematically study the 
dynamics of the selection probabilities, and compare different /3j priors with different weight 
on shrinkage. We address this issue by focusing on the orthogonal design. Interestingly, 
instead of priors with much weight on large shrinkage, our results indicates that the best 
performance of a prior is obtained by placing substantial weight on small shrinkage but 
not zero shrinkage. Among those (3j prior candidates, horseshoe prior is the one capable 
to maintain such shrinkage proportion for the widest range of shrinkage parameter, thus 
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considered as the best. 

We also consider one extension of BVGM to Bayesian sparse additive model (BSAM). 
Unlike the popular topic of Bayesian variable selection, there are only few papers discuss 



[ Reich et al. 


2009 


Scheipl 


2011 



Smith and Kohn, 1996). Based on the KM model, our BVGM is very straightforward to 



extended to BSAM. We employ the Lancaster and Salkauskas (LS) spline basis (Chib and 



Greenberg 



2010 



Lancaster and Salkauskas, 1986) to express the nonparametric function 



components. To our best knowledge, our paper is the first one capable to connect the 
graphical model with the nonparametric regressors such that we can select an appropriate 
subset of the function components and estimate the flexible function curves simultaneously. 

We first introduce the KM hierarchical model and full conditional distributions for sam- 
pling the parameters except 7 in Section [2j In Section [3] we discuss the connection between 
Bayesian variable selection and binary random graphical model and express our model as the 
Ising model with noninformative prior for 7. Then in Section [4], we first introduce the single 
site algorithm for sampling 7, then present a generalized Wolff cluster algorithm. In Section 
[5j we focus on understanding the selection probability profile including the dynamics of the 
selection probability under different shrinkage priors, and we discuss the connection between 
the simulated tempering algorithm and priors of the scale mixture of normals. In Section [6] 
and [7] we consider two extensions, one is how to incorporate prior network information for 
7, another one is how to extend to BSAM with LS basis. In Section [8] and [9j we illustrate 
our model with simulations and real data analysis. Finally, in the last section, we conclude 
our work and discuss other potential extensions of our model. 
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2 Bayesian Variable Selection with Normal Mixture 
Priors 



We are interested in selecting a subset of predictors from the p potential candidates. Thus 
we introduce the binary random vector 

7 = (7i,-,7p) T , 

where jj G (0,1), j = l,...,p is the binary indicator random variable corresponding to the 
jth predictor. With jj = 1 we selecte predictor otherwise exclude it from the model. 
To implement the stochastic search for j^s, SSVS considers a multi-mode point mass and 
Gaussian mixture prior for /3/s, [fyl'yj, rp] ~ (1—^)5(0) + 7jiV(0, rJ ), where 5(0) represents 
the point mass density at zero. 

In this paper we consider the KM model, which is expressed as 

v 

3 

where e ~ N(0, <fi~ l I) is independent identical noise vector. We standardize the data set X 
and center the response y such that Ym=i x % = 1> J27=i x v = 0; J = 1; ■■■V an d Y^=i U% = 0- 
We may also include an intercept term p, in model Q with a normal prior, which requires 
only a simple extra step in the sampling procedure. In Section [7j this parametric linear 
regression model is easy to extend to nonparametric additive model by using some basis 
function to express the jth individual function component with /3 • as a parameter vector for 
the jth predictor. 

The reasons we employ KM model in this paper are: first, it is more natural as a variable 
selection model, where jj = indicates that jth predictor has no effect in the response. 
Second, spike and slab models such as SSVS consider a multi-mode prior for {3/s which may 
have a mixing problem for sampling /3j's since flj's may get trapped in the point mass mode 
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for a long time. This problem becomes worse when we extend the SSVS to nonparametric 



additive model (Scheipl, 2011), because the chance of moving between the point mass and 
the normal model for /3 • becomes lower in higher dimensional space. The third reason can 
be demonstrated in next section where we can see that it is very straightforward to express 
a KM model in a Ising model, while it is difficult for SSVS models. 

In usual Bayesian variable selection, the normal prior assigned to (3/s has form [/3|rg] ~ 
iV(0, t7 Ip), where Tg is a common precision parameter for all /3j's and usually assigned an 
gamma prior with scale a/2 and rate 6/2. Similar prior can be [/3|r^] ~ N (0, Tg (X T X)~ l ) 



assuming = g~ l 4>, where g is a positive number called ^-factor (Liang et al. 2008). Be- 
cause of this simplicity, and (ft can be integrated out and a closed form of the posterior 
distribution for 7 can be achieved. However, we realized this simplicity has many disad- 
vantages for variable selection purpose. For example, if we integrate out and achieve the 
marginal prior of (3 as p((3\a, b) oc ((3 T (3 + 6) _E ^~, we can see the prior of /3/s are no longer 
statistically independent to each other. This is not a good idea to explore the whole joint 
distribution space of (7, /3) since the main purpose of Bayesian variable selection is to ex- 
plore the space of 7, while dependent f3 prior will limit the stochastic searching space. Based 



on this argument, we follow the shrink globally act locally scheme suggested by Poison and 



Scott (2011) to assign independent normal mixture priors for /3,'s. In next section, we will 



see that the interactions of the Ising model are determined by fl/s. The larger /3/s of two 
predictors, the larger the interaction between them, then the corresponding nodes have high 
probability to be dependent, meaning they are either "aligned" (both jj equal to 1, or both 
equal to 0), or "anti-aligned" (one jj equals to 1, and another equals to 0). On the other 
hand, if 0j's between two nodes are very small, then the two predictors are independent 
to flip their jj values. Therefore, we want fy's to be as flexible as possible to explore the 
configuration space, while we do not want to lose the control so we constrain the overall 
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variability of the interaction through a global parameter, which we refer to b. 

The shrink globally act locally scheme is easy to be implemented by following hierarchical 
model with scale normal variance mixture priors for /3j's. 

[/3,|r„6]~iV(0,6 2 rr 1 ), 

fal-lto), (2) 

where Tj is the precision parameter for the conditional normal prior of /3j and plays the 
role of local tempering. p(rj) and p(<fi) are the priors for r/s and <fi respectively. Similar 



hierarchical model in SSVS setting also has been discussed by Heaton and Scott (2010). 



With these settings, we can easily achieve the full conditional distribution for f3 c C (3 

[/3c|y>7>/3c^] ~ S ( 3 ) 
{ N(0,D^) if 7c = 0. 

Here we use a general subscript "c" to stands for subset of the index {1, ...,p}. We use c to 
present the complementary index set of c. In above expression, D c is a |c| x |c| diagonal 
matrix with Tj/b 2 ,j G c as the diagonal elements, where |c| stands for the cardinality of c. 
S c and n c are expressed as 

s c = (<pxjx c + D c y\ 

(4) 

Mc = 0S c X c T (y-X 7c _/3 c -)- 
With some notations abuse here, X c stands for the sub-matrix of X corresponding the 
predictors in c, and X^_ is the sub-matrix of X 7 = (71X1, 7 p x p ) corresponding to predictors 
in c. 

We simply assign a noninformative prior for : [0] ~ and the posterior distribution 
of 4> is simply a gamma distribution. 

[0|y,/3,7]~G(^||y-X 7 /3|| 2 ]. (5) 



The prior for r,'s are critical since it determines how the local action of the sampling 
process. Many different type of p(tj) can be considered. A very general review of different 



choice for p(rj) can be found in Poison and Scott (2010, 2011). In this paper we only 
consider three widely known p(rj-)'s that result in three typical marginal /3j's priors with 
characteristics of heavy tail, heavy mass around zero and both. We refer to these marginal 
priors of 0j as Cauchy, Laplace and horseshoe priors which are achieved by assigning gamma 

1 /2 

prior [Tj] ~ (7(1/2, 1/2), inverse gamma prior [Tj] ~ IG{1, 1/2) and half Cauchy prior [t- ] ~ 
C + (0, 1) to Tj respectively. The density forms for these three normal mixture settings are 
list in Table [T] respectively. Notice, to avoid the confusion, the terms of "Cauchy", "Laplace" 
and "horseshoe" not only refer to the marginal priors of /3j's but also represent the normal 
mixture settings. For example, in the context, "Cauchy prior" stands for normal/gamma 
setting such that the marginal prior of /3j is Cauchy and the prior for p(tj) is G(l/2, 1/2). 

Table 1: Summary of Cauchy, Laplace and horseshoe priors for the marginal prior of /3/s, 
corresponding priors for p(rj) and the density functions of the shrinkage parameter Kj. 



Marginal prior p(/3j\b) 



Prior for n 



Distribution for 



Cauchy 
Laplace 
Horseshoe 



Tib{P] + b 



2W 



Tj exp( 



(26)- 1 exp(-|/3,|/6) rr 2 exp(- 



2 - 
— ) 



Kj 2 (1 — KjYz exp( 



u 2 ,- 



2(1-Kj) ' 



2 exp( 



2fe 2 K, - 



Kj 2 (1 - Kj) 2 [1 - Kj + b 2 Kj\ 



By defining a scaleless parameter b* = b/y/(j), The full conditional distribution for Tj in 
Cauchy and Laplace settings are 

(Tj-b*/\(3j\f 



[Tj\pj,b] cc Tj 3/2 exp 
1 



[Tj\/3j, b] oc exp 



Laplace prior 
Cauchy prior 



(6) 
(7) 
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and the full condition distribution for Tj in horseshoe prior setting is obtained by 



[uj\f3j, b,Vj] oc exp 
[vj\/3j,b,Uj] oc vj 1 exp 

Tj = Uj/Vj, 



ft 



2 \b*\ 



1 //3] Uj 1 

2 V b* 2 vj 



2 „. + v i 



Horseshoe prior 



(|6| is an inverse Gaussian distribution ING(b* /\(3j\, 1) with mean b* /\/3j\ and shape param- 
eter 1. ^ is an exponential distribution or Gamma distribution G (l, (/3|/6* 2 + l)/2). The 
Gibbs sampler for horseshoe prior is implemented by using the redundant multiplicative repa- 



rameterization technique similar to Gelman (2006). Reparameterize tj as Tj = Uj/vj where 
Uj and Vj are independently distributed with prior G(l/2, 1/2) respectively, then the prior 

1 /2 

for Tj ~ C + (0, 1), and the prior for [3j is the horseshoe prior. In (8), the full conditional 
distribution for Uj is a Gamma distribution G(l, (0? /(b* 2 Vj) + l)/2) and the full conditional 
distribution for v~ is generalized inverse Gaussian distribution GING I 0, -hr-, 1 ) • 



3 Bayesian Variable Selection and Binary Random 
Graphical Model 

The noninformative prior for 7 is 7 ~ (|) . Thus the full conditional distribution of ~f\(3, 4> 
is directly derived from the likelihood of 7 given (3 and <p. Given f3, consider the matrix 
of marginal regression functions R = (ri, r p ) — (PiXi, ...,/3 p x p ), with each column as the 
marginal regression vector for jth predictor vector. In additive nonparametric model (see 
Section [7]), r, = fjfej) = Zjftj is the nonparametric function component of Xj expanding 
on the n x Mj basis matrix Zj with 1 x Mj coefficient vector /3 • {Mj is the dimension of 
the basis). Here we consider parametric regression model only, thus the full conditional 
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Figure 1: Diagram of variable selection as a random graph model with selected nodes (filled 
circles), excluded nodes (circles), edges of positive interaction (black lines), and edges of 
negative interaction (red lines). Independent variable selection: no interactions among nodes 
(a). General variable selection: a complete graph (b). 

distribution of 7 is 

p(l\y,P,<f>) ocp(y|7,/3,0) 

1 \ (9) 



oc exp ( — -07 R i?7 + 0y R~y j . 
This is nothing more than a Boltzman distribution of Ising model, ~ exp(— U("y)), with 

U( 1 ) = - 1 T J 1 -h T j, 

0^ (10) 
2 

h = <t)R T y, 

where Z = ^ 7 exp(— U(-f)) is called the partition (normalized) function and £7(7) is called 
the energy of state 7 given (3 and <f), J is the interaction matrix and h is called "external 
field" . Above expression of Ising model is equivalent to following model: 



P(l\y, A 0) oc exp ( ^ ■ , :i' S >i + ^i 7 -? ' ( n ) 



. i<3 
12 



where the first summation is on all i < j,j = 1, ...,p, — 1 if 7* = jj otherwise 5ij = 0, 
Jij = 0/3j(xfxj)/3j is the non diagonal element of matrix J and h* is the jth element of vector 
h* = <f)R T (y — Rl/2). Above expression is achieved by plugging in following transformation 
into g 

'G + H)H)]-*-{^i 

In the literature, the model with 7 distributed as ^ is called spin glass model (consider 
7j has two spin states, up and down, corresponding to 1 and respectively) when the 
coupling parameter follows some random distribution with positive or negative values. 
In our Bayesian variable selection model, because is the product of random variable 
flj's and <p each has a prior, the distribution for is some unknown distribution usually 
is neither iid nor tractable. Therefore, numerical method, such as MCMC sampling to 
simulate the distribution of 7 is required. Now we can see that the choice of the prior for 
(3j's is important since it directly effects the interaction among the nodes. The independent 
scale normal mixture prior for (3/s is a nice choice since it is similar to the well known 
tempering algorithm in Ising model, and we can also derive some cluster algorithms. Both 



algorithms are expected to improve the mixing issue of the sampler (Nott and Green, 2004 



Swendsen and Wang, 1987; Wolff, 1989) 



Based on the Ising model, considering the p predictors as a set of nodes, we assign a 
binary random variable jj for each nodes. Those nodes may interact or couple with each 
other as described by a Ising model, so we have the following proposition: 
Proposition 1: The p dimension binary random variable 7 G {0, 1} P of the Bayesian 
variable selection problem based on KM model |7]) is a class of stochastic processes on a 
finite random undirect graph model G = (V,E), where V = {l,...,p} is the set of nodes, 
corresponding to p predictors, and E C V xV is the set of edges. 7 G T = {(71, 7 P ) : 7j G 



(0, 1), j = 1, ...,p} is indexed by V with probability measure on T as (11), in which 's and 
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h* 's are all random with some distributions determined by the priori distributions of (3j 's and 

This is a complete graph model, since we don't limit the connection between any two 
nodes of V and the coupling between two nodes are long-range interaction. Figure [T] is 
the diagram of the graphical model for Bayesian variable selection. In Figure [I] (a), the 
interaction between any nodes Jy = thus this is a complete independent setting with 
which the configuration of 7 depends on the "external field" h only. Figure [I] (b) is a more 
general diagram for the complete graphical model. However, since any possible J is allowed, 
for a given J, a specific configuration of the edges will be given. For example, for a one 
dimension Ising model, the nodes form a one dimension chain, and one node only interacts 
with its two nearest neighbor nodes. This means the matrix J is a sparse matrix with non 
zero elements in positions \i — j\ < 1 only. Furthermore, we can also consider the external 
field h* as a node indexed by which represents the response variable y, except that 70 = 1 



is fixed. Then (11) can be expressed by a more compact form 

P(7|y> A <P) « exp I ^ J ii 6 



i<3 



where i,j = 0, 1, ...,p, and J™ is the extended matrix with the first row and column equal to 



h*. However, in this paper we keep focus on expression (11) for explicitness. 



4 Updating of 7 

4.1 Single Site Algorithm 

The joint posterior distribution of 7, (3 and will directly give this posterior distribution for 
7. Because ^ and (11) are the direct form of Ising model, we can direct apply the Gibbs 



sampler procedure for 7 based on ^ and ( 11 ) after sampling (3 and <p. This means we assign 
a noninformative prior for 7, 7 ~ Q) P ) an< ^ ^ e conditional distribution given the data 
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for single site updating is 

hj\y,l],P,<t>] ~ Ber (t~^t) ' 

7T = exp {-[Ufri = lhj) - U^j = 0| 7J )]} ( 13 ) 

= exp(-Jy - 2J j - j j- j - hj), 
where Ber stands for the Bernoulli distribution and Jji is the jth row of J with jth column 

removed. U(jj = 1|tj) — U(jj = 0\jj) is the "energy" difference of two state configurations, 

7j = 0|Tj an d 7j = where 7^ is the vector of 7 with jj removed. Therefore the complete 

full conditional distributions of Gibbs sampler to update 7, f3 and <p involves expression ([3]), 

Q, (§ and one of ([6j|8]) . This procedure is very simple and works well for most cases with 

moderate size p. The main advantage of our procedure is there is only one tuning parameter 

b, and the "tuning" process is extremely simple: just choose a b that separates the signals 

and noises with the largest gap in the marginal selection probability. 

Above Gibbs sampler to update is one Matropolis-Hastings (MH) step with the Gibbs 

proposal and acceptance rate equal to one. We can also consider a MH one-step updating, 

which is more general in Ising model sampling. Denote the current state for jj as 7°|7j and 

its flipped state 7* whether or not we move from 7°|7j to 7* |7j depends on the "energy" 

difference AU = [/(t||7j) — {7(7° |7j). We prefer the system in lower "energy" state since 

the lower the energy the higher the probability. Thus if AU < 0, the flipped state is accepted 

with probability 1. We treat the case AU > probabilistically, that is, with the probability 

to accept the flipped state as p(AU) = exp(-AU). These steps can be summarized as that 

we flip current state to its opposite rather than remaining the current state with probability 

min(l,exp(-AC/)) . (14) 

The detailed balance maintains and this MH updating is used in the MCMC Ising model 



sampling (Newman and Barkema, 1999 Nott and Green, 2004). In this paper, unless oth- 



erwise specified, we adopt this one step MH updating ( 14 ) with other Gibbs samplers in all 
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a) b) c) 

10 10 





Figure 2: Diagram of the cluster algorithm. Forming the cluster (a-c). Flipping clustered 
nodes (d-e). 



cases. Indeed, it is the antithetic updating method discussed by Nott and Green (2004) since 



exp(— AU) is the odds of flipping current state with the Gibbs type proposal. 



4.2 Cluster Algorithm 



Beyond the single site algorithm, the cluster algorithm is well established for simulating 



model rt9J) when J^-'s and h/s are fixed. There are abundant literatures available about 



the cluster algorithm in Ising model (Newman and Barkema, 1999 Swendsen and Wang 



1987 Wolff, 1989), but it has been introduced to Bayesian variable selection recently only 



16 



(Nott and Green, 2004). In general, a cluster algorithm performances better than the single 
site updating when is fixed. However, as pointed before, the model ([£]) is difficult in 
applying the clustering-updating algorithm since there is a random external field h* and the 
coupling coefficients Jy-'s follow some unknown distribution and are not independent. Plus 
the nodes are connected with each other by so called long-range interaction thus the system 
is a totally disordered complete graph. In this paper, we propose a generalized single-cluster 
Monte Carlo algorithm which is closel to Wolff's clustering scheme but capable to handle 
the situation with long-rang random interaction and external field. 

In the original SW and Wolff algorithm, clusters are formed through the bonding be- 



tween paired nodes with positive interactions. Although Nott and Green (2004) proposed an 
auxiliary variable technique to count the negative coupling between nodes and form clusters 
including anti-aligned nodes, their method is still based on the single bond between two 
nodes, which means whether adding a new node to the cluster is determined by the interac- 
tion between the new node and ONE node in the cluster. Unlike the usual Ising model on 
one dimension chain or two/three dimension lattice, the complete graph model of the binary 
random process is fully connected. This indicates each single node behaves according to the 
overall effects of all other nodes. Therefore, the clustering dynamics must incorporate this 
consideration. In other words, the growth of a cluster (adding one new node to the existing 
cluster) should consider the coupling between the new node and all nodes in the cluster. 

Before introduce the cluster algorithm, we specify two types of clusters since the cluster 
is formed according to the coupling coefficient Jij which can be either positive or negative. 

• a cluster with nodes aligned. 

• a cluster with nodes aligned and anti-aligned. 

We use c to denote the cluster, and c as the complement of c. The single node is considered 
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as special case of the cluster with aligned nodes. So within the second type of cluster, there 
are two sub clusters anti-aligned to each other. We denote these two sub clusters as c\ and 
Co with 7 Ci = 1 and 7 CQ = respectively. 

The question then is, given a particularly defined probability p a of adding a node to 
the cluster, what is the acceptance ratio that make the flip of the cluster satisfies detailed 
balance, and how to choose p a such that the average acceptance ratio is as large as possible? 
So we derived following generalized Wolff algorithm based on these considerations. 

1. Form the cluster. 

(a) Initialize the cluster set c by randomly picking a seed node. 

(b) Examine the nodes in c one by one, add the node j in c to the cluster with the 
probability 



P. 



max < 1 — exp 



*(-W £ J * - £ J * ) A 

Vfceci leco / J J 



(15) 



and remove j from c if j added to c, where < A < 1. Continue iteratively until 
no new sites added when each nodes in c has been examined. 



2. Flip the nodes in cluster c with probability 



mm < exp 



;i - a) £(-ip £ J jk - £ j jt )+ £ h* - ]T h* 



min {exp [(1 - A)(1 T J CQC - - l T J ClC -)(2 7c - - 1) + l T h; - l T h:J , 1} 



3. Flip the rest nodes in c (if any left) by single updating method (14). 



(16) 



4. Update {3/s, t/s and <f>. 



In (16), the last expression is for the convenience of coding using matrix expressions. As 



we can see, parameter A plays a role of partial clustering similar to Higdon (1998). When 



A = 1, all interaction terms in (16) are annihilated, which means the coupling of the cluster 
with its neighbors are totally decoupled. If A = 0, then no clustering process, the algorithm 
is reduced to single site algorithm. 

The cluster algorithm can be better explained using the diagram in Figure [2] Figure 
[2] (a-c) demonstrate the clustering process. First we randomly select a seed node, in this 
diagram, node 8. Then we throw the bond to all neighbors of node 8, and find node 5 is 
bonded to 8 with probability p a ^ and forms the cluster (the dashed line is turned into solid 
lines, meaning 5 is added to the cluster). We scan the remaining nodes again but whether 
or not a new node should be added is determined by the bonding between the new node 
and node 5 and 8. For example in Figure [2] (b), the bond between the new node 4 and the 
cluster is the overall bonds 4-5 and 4-8. In Figure [2] (c), after add the last new node 1 into 
the cluster, we scan all the left nodes and find no new node added to the cluster, then the 
clustering process stops. 

The flipping of the cluster is demonstrated in Figure |2](d-e). The cluster formed contains 
nodes c = {8,5,4,2, 1}. To flip these nodes, we have to cut off the bonding of the cluster 
with all other nodes in c because in a complete graph the neighbors of a cluster is all other 
nodes outside of the cluster. For example, the bond between the cluster and node 10 is 
demonstrated in Figure [2] (d), where we can see the bonds between 10 and all nodes in the 
cluster should be cut off to flip the cluster. Thus to completely flip the cluster, the bonds 
between all other nodes in c and the nodes in c should be cut off. Similarly, in the reverse 
process to flip the cluster back, as shown in Figure [2] (e), all the bonds between the cluster 
c and the nodes in c must be cut off. 

It is easy to show that our algorithm is more general in sense that it is applicable to the 
complete graph with random interaction. When applied to Ising model on grid with posi- 
tive fixed interaction J (only interactions among nearest neighbors account), our algorithm 
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evaluate to the original Wolff algorithm: the cluster growth by throwing bonds to nearest 
neighbors with probability 1 — exp(— J). 

Following theorem shows the algorithm stated above satisfies the detailed balance and 
ergodicity. 

Theorem 1: With the probability of adding node to the cluster, p a ,j, and the probability of 
moving from current configuration 7° to the flipped configuration 7*, 0(7° — > 7*), as defined 
as in the generalized Wolff algorithm, the algorithm is detailed balanced and ergodic. 



Proof. See |A.l 



In this paper, we mainly focus on the noninformative prior for 7. However, since the 
distribution of 7 given (3 and (f) follows the Boltzman distribution, it is nature to assign 
a Boltzman prior or Ising prior for 7 if such priori information is available. For example, 
in some genetic data, the genes form a network that can be descried using special graph 
model, with this information we can assign a Ising prior with specific interaction matrix 
that represents the priori graph structure. We will discuss this issue in Section |6j Another 
advantage of the cluster algorithm is it reveals the latent graph structure according to the 
frequencies of nodes that form a cluster, and this information may help us to distinguish the 
signals and the noise since the signals and noise should have high frequency to be anti-aligned. 



5 Understanding the Mechanism of Bayesian Variable 
Selection 

The purpose of this section is to understand how the marginal probability p{^j\y) evaluates 
under different choice of marginal prior of (3 given the only tuning parameter b. Although 
our Ising model is based on the KM model (jl]), the results of this section is also applicable 
to SSVS model with the point mass mixture prior for (3. This is because if we integrate 
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out f3, both SSVS and KM models are identical. Note that all the results in this section is 
based on parametric linear model ([T]) where f3j is scalar, but the major results are similar to 
nonparametric linear model where /3„- is Mj x 1 vector. 

Some notations are introduced here. Since x/s are standarzed, C = X T X = [c±, c p ] is 
the correlation matrix of x^-'s and Cj's stands for the vector of the correlation between Xj with 
all predictors. For orthogonal data set, C = I n or xfx^ = 5ij and xje = 0, i, j = 1, ...,p. The 
projection of y on x^- can be expressed as a = X T y = (x^y, xJy) T = (ai, a p ) T , which 
are the estimation of the signal f3/s under orthogonal design. We may also need notation 
c T , = 7j ° { c j}j> where "o" stands for the pointwise product of two vectors, "j" stands for 
the jth element removed for corresponding vectors and matrices, 

5.1 General Profile of the Marginal Selection Probability 

With the hierarchical model defined Q and ^ in Section [2j the posterior distribution of (3 
is multivariate normal given 7 with mean and variance S as. 

fi = 0SX^y; E = (0X^X 7 + D)'\ 

where D is a diagonal matrix with diagonal element {p"} 1<J<p - To understand how Tj 



and b introduce the shrinkage effect, similar to Carvalho and Poison (2010); Poison and 



Scott 



(2011), it is convenient to introduce the shrinkage coefficient, Kj = {j^j / + w^j- 
Under the orthogonal design, the posterior mean and variance of /3j's corresponding to 
3 e {j ■ 7,- = 1} are 

S(/9 i |y,0,6) = [l-E(« i |y,0,6)]a i , 

(17) 

V^ar(/3 J |y,0,6) = [l- J E;(^|y,0,6)]0- 1 . 
The coefficient Kj's represent how much shrinkage being placed on the initial estimation 
of /3j's. Kj — » 0, yields no shrinkage, and Kj — > 1 yields near-total shrinkage. With this 
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definition of Kj, it is easy to derive the density function of Kj, p(kj). Table [T] lists p(«j)'s 
based on the three prior settings given 0=1. 

In order to compare the performance of different variance mixture priors p{Tj) on the 
marginal selection probability and avoid notation abuses, it is convenient to assume fixed 
and use a scaleless transformation \/~4>y —> y such that ajy/cj) —> a,j, by/4> — » b and y/(j)C — > C . 
This is equivalent to assume 0=1, but keep in mind that a/s, Cj-'s and b are scaled by \/0 
unless stated otherwise. 

o 



(X) 

o 



CO 

o 



d 

CM 
O 



O 
O 

0.0 0.2 0.4 0.6 0.8 1.0 1e-03 1e-01 1e+01 1e+03 1e-03 1e-01 1e+01 1e+03 

K i b b 

Figure 3: The curves of selection probability against kj (a). The curves of marginal selection 
probability against global shrinkage parameter b (b). Marginal selection probabilities with 
baseline subtracted (c). All plots are under orthogonal designs. 




With these coefficients defined, following theorems connect the marginal odds of 7j given 
the data with k/s and b. Based on iQ and (|2]), the join distribution for 7, (3, r given b is 

P(l, A r|y, b) oc p(y (7, /3)p((3\r, b)p(r), 

and the marginal probability for jj is p(7j|y, b) = J J2-y, P(lji 7j> r ly' b)df3dr. Thus the 
marginal odds for 7^ = 1 given 6 is 

b JJ2^P(lj = 1 n-j,P,T\y,b)d(3dT 

7T ■ — 
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Theorem 2: For the Bayesian model defined in |7]) and (|l|) ; the marginal odds of jj, 
defined as 7i b j; has following form 
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where p(kj) is the density function of Kj, 



■Rj = k exp 



Ml - K 



■3/ 



and £j is a positive real function of Kj 



1. For qeneral cases 



C(^/-''/-7,-T. ; ) = exp 



with <> ; = [D 3 + A^X,. - (1 - ^(c^c* )] 
For orthogonal designs, £j = 1, and 



I E 75 £(7j = 0, « is 7j, T 5 )p(rj)drj ' 



- (a, - (1 - kJOojCtj) 1 ( a j - (! - >% c t 



T M-l 



7T"j = I 7Tjp(Kj)dKj. 



(19) 



(20) 



i%r /2 p# /2 



(21) 



For orthogonal designs, if Kj — >■ ; £/ien 7Tj — >■ ; and i/Kj — >■ 1, i/ien 7Tj — )■ 1. Similarly, 
if b — > 0, then i\ h j — > 1, and if b — >■ 00, i/ien — > 0. 



Proo/: See A.2 



From Theorem 2 we can see that in general the marginal odds rf^f 7Cjp(Kj)dKj, the 



marginal odds of the orthogonal design. According to equation (18), when the correlation 



among predictors are not negligible, the odds will be "blurred" by the coefficient and the 
marginal selection probability is blurred too. Basically, £j is a complex function of aj, cj, and 
Tfc/6 2 , k 7^ j or Kj. Furthermore, it is infeasible to calculate £j given large p with more than 
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2 predictors are correlated. However, we can focus on the orthogonal design to understand 
the mechanism of marginal selection probability in general since it is much more easier to 
calculate. 

Combining Theorem 2 and Figure [3J we can understand the behaviors of ttj and tc^ better. 



Figure [3] (a) plots the selection probability as a function of Kj according to odds ttj (19), 
and Figure (b) plots the marginal selection probabilities according to tt^ with different 



prior p(KjYs. We can see that for the orthogonal design, expression (19) and (21) indicate 
that 7ij and tt^ are monotone functions of cij, this is demonstrated as the different selection 
probability curves in Figure [3] too. In ideal case, all noise predictors will demonstrate the 
same selection probability since a,j — 0, which defines the baseline selection probability curve 
in Figure [3] (a-b). 

Thus ideally, any signals with cij ^ are deviated from the baseline curve. However, 
when correlations among variables do not equal to zero, the situations become complicated. 
First, even though the correlation among variables are small enough so ttj and TTj are still 
monotone function of cij, the baseline will be blurred and extended to a band. To see this, 
consider k e V* and j G V* where V* is the set of true nodes and its complement is V*. 
Because Ckj = x^x^ ^ 0, x^. will have fake signal: = x^y = CkjCij. Thus all the noise 
predictors will demonstrate false signals as long as they have nonzero correlations with the 
true signals. This makes separating the true variable with small signals from the noise 
difficult. Secondly, because of £j, even for large signals the selection probability will be 
distorted by their correlated fake signals. For example in Figure [3] (b), if aj = 2 is the fake 
signal and cij = 4 is the true signal and they are correlated, then the profile curve of a,j = 2 
and a,j = 4 will show some "interacting" behavior at b « 1 where the selection probability 
of fake signal reaches the maximum (we will show this behavior in the simulation analysis). 
Thus in general the largest gap that separates aj = 2 and dj = 4 is not around b ~ 1, but in 
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two regions around b ~ 0.1 and b f» 1000. 

Furthermore, the second result of Theorem 2 states some asymptotic behaviors of ttj 
and 7T^ as Kj — > or b — > oo and «j — >■ 1 or b — > 0. This can be clearly seen in Figure [3] 
(a-b) where with small shrinkage (kj — > or b — > 00), both irj and 7r* approach 0, and with 
large shrinkage (kj — > 1 or b — > 0), they approach 0.5. However, the dropping rate depends 
on the magnitude of the signal and the prior For example in Figure [3] (a) we can 

see for large signal aj = 4, the selection probability maintains at 1 for Kj — > till the last 
point. In Figure |3] (b), furthermore, we can see the selection probability curves are different 
for different priors: some drop very fast, such as Laplace prior, some are pretty robust to 
shrinkage such as horseshoe prior. 

So choosing an appropriate prior for Tj or p(nj) is important. Our next question will 
be how to choose an appropriate prior. Based on Figure [3] and Theorem 2, there are some 
guidelines to choose the prior p(tj): (1) The rate of tcj to increase must be fast when the 
signal increases, so that the large true signal can be separated from the noise more easily. 
(2) 7Tj drops to 0.5 or slowly when b — > or b — > 00 so we have a wider windows of b where 
the true signals maintain high selection probability. 

Following theorems will further help us to understand the relationship between iij and 
the shrinkage coefficient Kj. 

Theorem 3 For the Bayesian model |7p and with orthogonal design, suppose prior 
p((3j) is a zero mean scale mixture of normals: [/%|tj,&] ~ ^(O,^ 2 ^ 1 ), with Tj having proper 
prior p(rj). Define the marginal density rrij = m(y\'jj = 1) as 

If rrij is finite for all y, then 
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1. 



Id d 

E(Pi |y> 7j = 1) = % + — xj— mj = 0i + xj— log mj-. (22) 



£(/3,|y, 7j = 1) = A log7r & = [l_£(«.|y )Tj = i)]a y (23) 



Proof: See |A.3 



The first result of Theorem 3 is well known in Bayesian literature and can be found in 



Pericchi and Smith (1992) and Poison and Scott (2010) for more discussion, here we just 



simply extend it to the linear model case. We are more interested in the second result which 
gives the relationship among the expectation of (3j, the derivative of log 7T* respect to Oj, 
and the shrinkage coefficient. Since we prefer a larger derivative of log marginal odds such 



that the large signal can be separate from the baseline further. (23 ) indicates that to achieve 
this purpose, it not only requires a large dj, but also requires the expectation of shrinkage 
parameter Kj to be small. This is confirmed by Figure [3] (a), where we see that the largest 
separation between the signals and the baseline is on the side of Kj — > 0. Thus if integrate 
out Kj to have n b ^ we want the density p(Kj) has substantial mass around the region with 
largest separation. However, we don't want Kj = since it means exactly no shrinkage and 
all 7Tj's drop to zero at this point. 

Therefore, the general requirement for a based on Theorem 2, 3 and Figure [3] is 



to maintain substantial mass around region on the small shrinkage. Surprisedly, (23) seems 
contradict to the usual variable selection strategy that to recover the sparsity in the region 
of large shrinkage. In fact, it is possible to separate the signal and noise in large shrinkage 
region and large shrinkage does have some advantages, such as stability, faster mixing, less 
sensitive to nodes number p and so on. So it is a second choice as long as the signals are 
robust to large shrinkage, at least for large signals. However, in this paper we focus on the 
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small shrinkage region where the consistency in variable selection seems satisfied more often. 



5.2 Dynamic Properties of the Odds with Different Priors 

To explain the different behaviors of the selection probability caused by different priors, we 



need examine more details about the density distribution of Carvalho and Poison 



(2010) and Poison and Scott (2010) discussed the performance of different types of priors in 
the Bayesian regularization with difference weight on shrinkage. They focus on the effects 
on the estimation of the signals. We are looking at those priors from a different point of 
view in terms of variable selection based on the selection probabilities. 

In the prior for /3/s, b is the global parameter. As b — > 0, large global shrinkage is applied 
on all /3j's, and as b — >■ oo, the global shrinkage effect will be negligible. Table [T] lists the 
prior p(Tj)'s, and corresponding p(«y)'s as well as the marginal prior p(f3j\b)s. Because of the 
existing of b, how much weight is put on the shrinkage is modified, and for different priors 
this modification is different. 

To see this, Figure [i] compares density function p(/3j|6)'s around zero point and on the 
tails, and density function p(/tj)'s given different fe's. By examining p(Kj)'s in Figure [1] 
together with Figure [3] (c) , we can understand how b effects the selection probability profile 
through putting different weight on shrinkage. Figure [3] (c) plots the selection probability 
profile with the baseline subtracted. It can be seen for orthogonal design, the larger the 
magnitude of the selection probability, the larger the true signal distinguished from the 
baseline. In small b (b < 1) region, the descendant order of the magnitude is Cauchy, 
horseshoe and Laplace prior for a given signal, which is consistent with the p(Kj) plots in 
Figure [4] at b = 0.1 where the order of density mass on the small Kj region is Cauchy, 
horseshoe and Laplace prior. In addition, since Cauchy and horseshoe priors put similar 
mass around small Kj side, their selection probabilities behave almost identically for small 
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Figure 4: Marginal prior density functions of /3j and density functions of Kj for different b. 



b as shown in Figure [3] (c). On the other hand, this order changes for b = 100 where it 
becomes horseshoe, Laplace and Cauchy prior in Figure |4j Again this is consistent with the 
selection probability order in Figure [3] (c) for large b (b > 100). The reason that Cauchy 
prior becomes worse for large b is because all the mass of p(kj) is absorbed to Kj = which 
is not we expect as mentioned before. For a moderate b, such as b = 1, all priors have 
substantial mass around small Hj side as shown in Figure |4j thus all behave similarly. This 
is confirmed by Figure [3] (c) where the selection probabilities for different prior seems similar 
around b = 1 at least for large signals. 
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Above analysis also shows that more weight on large shrinkage is not as important as on 
small shrinkage in terms of distinguishing the signals. Therefore, horseshoe prior is superior 
to other two, even though p(/3j\b) of horseshoe prior does not have long tail as much as 
Cauchy prior for small b. Horseshoe prior does demonstrate that for a wide range of b 
maintains substantial mass on the small shrinkage side of Kj. On the other hand, Laplace 
prior has almost zero mass around small Kj side when b is small, and Cauchy prior has 
all mass abosorbed to Kj = when b is very large, each deteriorates their performance 
for those b values respectively. Our argument to evaluate the priors is thus different from 



Carvalho and Poison (2010) where they argue that the horseshoe prior is superior because it 
has substantial mass on both small shrinkage and large shrinkage in terms of estimation. Of 
course, although we prefer small shrinkage in terms of variable selection, large shrinkage does 
have advantages that some times we must consider. For example, we found in the simulation 
that with large shrinkage the Gibbs sampler can converge faster even with very large p. 

To further examine the dynamics of the selection probability profile, following theorem 
gives some asymptotic behaviors about the derivation of log ttj respect to \aj\ and b, and so 
it helps us evaluate different priors. 

Theorem 4-' Consider the inverse of Tj, <r| = r~ , has prior density, p(cr|), as a 2 — > oo, 

for some slowly varying function L(x) such that as x —> oo for all t > 0, L(tx) / L(x) — > 1 , 
then 



1. as dj — > oo 



d u hi + f=r if A = 

— 7— — r log 7T^ ~ \ ' " K l (24) 

d\^\ \ K-l + g-^A if A>0 . 
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2. For large a,-, as b — > 



d, ,1 slogtf (a?) </ A = 




* " ' ^M + £logL 6 (M) •/ A>0. 



where L (x) is a function conditioning on b (see A. 4) 



Proof. See |A.4 



Particulary, a L 6 (:r) has forms of frexp ^— ,6 2 , and for Cauchy, Laplace and 

horseshoe prior respectively. When \aj \ is large, as Theorem 4 assumes, 4 logL fc is negligible. 



Theorem 4 is similar to the tail robustness theorem discussed by Poison and Scott (2011 ) 
about marginal density m(y\ r yj = 1), which implies that the shrinkage will vanish for any 
scale mixture normals with with heavier tails (such as Cauchy and horseshoe prior), 

while remain non-diminishing for p(a 2 ) with exponential tails (such as Laplace prior). Com- 



bining with (22) of Theorem 3, we get the similar conclusion about the estimation of (3j that 
it is robust if estimated by long tail priors. Similar robustness can be found for ttj. The ro- 
bustness of means fast change rate of %j as signal magnitude increases and small change 
rate of 7Tj as shrinkage increases, which are important since these two characteristics can 
make distinguishing the signals easier. Large ^-y log 7rj helps to distinguish the signals from 
the baseline, while small j| log Hj leads to a wide window of b where the selection probability 
of true signals remain highly. 



Expression (24) indicates that for priors with exponential tails (A > 0), the selection prob- 
ability 7r^ increase with a smaller rate when the signal magnitude increases comparing with 



the heavier tail priors. Meanwhile, expression (25) shows that for priors with exponential 
tails, iTj drops with much faster rate (~ \/2X/b 2 ) as b — » 0. We can also compare the drop- 
ping rates of log7r^ of Cauchy and horseshoe prior as b — > 0. Since L b (x) = exp ^— b and 
(1 + b 2 /x 2 )~ 1 b for Cauchy and horseshoe prior respectively, it turns out d log L b (x 2 )/db w 1/b 
as b — > for both priors. This means the dropping rate as b — > is similar for Cauchy prior 
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and horseshoe prior which is confirmed in Figure [3] (b). Note the second conclusion of The- 
orem 4 does not apply to b — > oo unless |oj-| — > oo faster than b — > oo, i.e., &/l a il — > so 
L b maintains as slowly varying function. However, as shown by the exactly calculation in 
Figure [3] (b), horseshoe prior is also the most robust one as b — > oo. 

Figure JHJ gives the exact calculation of r a = ^-y log tt* and = ^ log ttj for three priors. 
In Figure [5] (a) we can see that r a is the nearly the same for three priors at large b. However, 
when b is small, r a is reduced by certain value for Laplace prior, meanwhile, it remains the 
same for Cauchy and horseshoe prior. So the exact calculation just confirms Theorem 4. 
Similarly, the exact calculation also confirmes the result of Theorem 4 about r&. As we can 
see in Figure [5] (b), increases exponentially as b — > for Laplace prior, which means as 
6—7-0, the selection probability by Laplace prior will exponentially drop to 0.5, and this 
behavior has already been observed in Figure [3] (b). 



Based on discussion in Section |5.1| and this section, horseshoe prior performs the best in 
terms of the marginal selection probability, Cauchy prior is in the second place, and Laplace 
prior is the worst since the selection probability drops too fast as shrinkage increases. 

5.3 Some Expressions for Ti b - 

In above sections, we discussed the properties of the odds 7r^ for orthogonal designs, but 
did not show how to calculate ttj. Those curves are calculated by Monte Carlo simulations 
which are very precise. In some cases, we may want to calculate ir^ directly. There is no 
closed form of 7r^ for the three different priors. However, we can see at least for Laplace and 
horseshoe priors, iij can be expressed by some special functions. 
For Laplace prior, 



A = vl 6 lexp 



Oil -b 



-1\2 



{$(h| - ft -1 ) +exp(2|a i |r 1 )$(-|a J | -r 1 )}, (26) 



2 

where $(•) is the CDF of standardized normal distribution. This expression can be directly 

31 




X) 



CT> 
O 




a,=20 b) 



100 500 1 5 20 100 500 1 5 20 100 500 



024602460246 



Figure 5: The derivative of log odds respect to \a,j\ against given different b (a). Note the 
curves of Cauchy and horseshoe priors are overlapped. The derivative of log odds respect to 
b given different aj (b). 

used to calculated the marginal selection probability given b. 
For horseshoe prior, 

l,?,5,l-6- 2 V (27) 



7T; 



where Be(- ■ •) denotes the beta function, and $i( 



is the degenerate hypergeometric 



function of two variables (Gordy, 1998; Poison and Scott, 2010). The calculation of $i can 



be employed by using a series of hypergeometric 2 -Pi functions (Gordy| 1998|). 
The derivative of above expressions is shown in 



A.5 



ttj of Cauchy prior does not have 



an analytic form, and its can not be represented by known special functions neither. Hence 



we simply use the Monte Carlo approach to calculate 7r* for Cauchy prior. 



5.4 Simulated Tempering and Generalization by Levy Process 

Li and Zhang (2010) discussed the difficulty of sampling around phase transition in a SSVS 
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model by assigning a Ising prior for -y. The difficulty is, given a Ising model there is a 
threshold for the interaction strength, when the interaction magnitude is larger than this 
threshold, the MCMC sampling will dramatically slow down, resulting in either overwhelming 
many selected nodes or extremely few ones. It becomes even worse when Jy's and h/s 
are all random, such as our model. However, the family of exchange Monte Carlo and 



simulated tempering algorithm has be developed to handle the slow mixing problem ( Geyer 



and Thompson, 1995 Iba, 2001 Lyubartsev et al. , 1992). By introducing the scale normal 



mixture for (3, our model is an special simulated tempering algorithm which thus improves 
the mixing issue too. 

To understand the simulated tempering algorithm, consider the usual Ising model with 
U(*y, J) = X^*<j Jij^ij (f° r simplicity no external field h* included), then the Boltzman 
distribution is expressed as 

Ph\T, J) = ^f- ) <*P [-T-ty (7, J)] , 

where T represents the temperature (or the scale of variation), and is random and follows 
some distribution p(Jij) such as standard Gaussian distribution. When T — > 0, the effective 
interaction = T" 1 — > 00. Thus if T is lower than some critical temperature, the strong 
interaction will lead to some non-ergodic behavior such as the slow down of the MCMC and 
extremely large proportion of jj = 1. The reason for this is because the low temperature 
phase of disordered Ising model generally has numerous local minima which are separated 
to each other by energy barriers. The characteristic time in which the system escapes from 
a local minimum, however, increases rapidly as the temperature decreases or the interaction 



increases. A good review can be found at Newman and Barkema (1999) about this issue. 



The family of tempering algorithm treats temperature T as a dynamical variable (Lyubartsev 
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et al. 1992), and the joint distribution p(7,T) is represented as 

p(l,T, J) ocp(j\T, J)Y\_p{Jij)p{T), (28) 

i<j 

where p{T) is the distribution of T. The prior information for the T thus represents the range 
and mass of the temperature to sample the MCMC. With some variable transformation by 
replace T^ 1 ^ 2 Jij with JL-, where T 2 = T b , the joint distribution (28) then becomes 



p( 7 ,T 6 , J) ex p( 7 | J) n P (J ii |T 6 )T- 1/2 p(T 6 



1/2 j 



where p(-y| J) oc exp — . , p( |Tb) oc p(T fe Jij)T b (with some notation abuse, the 



later p(-) represents the same density function of p( .%))■ Clearly, TJ, is a global temperature 
parameter here. If we introduce the local temperature parameter T b for each interaction J^, 
then the marginal prior for is 



POO 

p(Jij) oc / p(Jij\T b )T b ~ ll2 p(T b )dT b 
J o 



If p(Jij) ~ -/V(0, 1), above posterior for jjj is a normal scale-mixture whose mixing measure is 
expressible in terms of the density of the subprdinator T&. Hence according to the Theorem 



3 of Poison and Scott (2011), with the simulated tempering algorithm the interaction of the 



random Ising model (28) can be expressed as a Levy process mixture scaled by T&, and T b is 
a nondecreasing pure-jump Levy process with marginal density p(T b ) at time b. 

As another algorithm in the same family, the exchange monte carlo algorithms (or parallel 
tempering) is to simultaneously and independently simulate K > 2 replicas of the MCMC 
trace under different temperatures, and exchange the 7 configurations of the replicas with 
certain acceptance probability by referring to the energy cost AU. The analogy between the 
exchanged monte carlo and simulated tempering algorithm is clear in terms of the mixture 
distribution of the interaction J^. For simulated tempering the mixture weight is the con- 
tinuous prior p(T b ) while the exchanged monte carlo is mixed with weight on a set of discrete 
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temperatures. In both algorithms, the low temperature process can access a representative 
set of local energy minimums with the accompany of the high temperature process which 
are generally able to sample large volumes of configuration space to keep the configuration 
from trapping in some local minimum. 

We see that how to understand the simulated tempering algorithm as the Ising model 
with normal scale-mixture prior mixed by the Levy process. On the other hand, BVGM 
with Levy process mixtures can also be understood as an Ising model sampled by simulated 
tempering algorithm. To see this, we can generalize both Cauchy and Laplace prior into the 
framework of normal/generalized inverse Gaussian mixture. The marginal prior of /3j for 
both priors can be expressed using one formula 



oo 



p(pj\u,v,w)= / p(Pj\Tj)T j 1 g(T j )dT j = / p{P j \T j )p{T j \u,v,w)dr : 



o Jo 



I 



' exp(-^r i /2)r^ 5 ^rr 1 exp[-(«S + ^r i - 1 )/2]dr, 



(1 v 2tt r\ r 3 j, , j 2K w (uv 

y jw K w+ i/ 2 (vyfijf+V?) 



1 {u/i 



V2i KJuv) (^^/p' 2 ' 

(29) 

where K w (x) denotes the modified Bessel function of the third kind with w. p(rj\u,v,w) 
is the generalized inverse Gaussian (GING(w,u,v)) distribution with parameters u, v and 
w such that w G K. while u and v are both nonnegative and not simultaneously 0. Note 
for the two special cases we adopted in this paper, Cauchy and Laplace prior, the values 
for these there parameters are on the boundary. However, it turns out both the prior 
p(Tj\u,v,w) and the marginal prior p((3j\u,v,w) as the limit exist. For Cauchy prior, 
w = 1/2, u = b, and v — 0, thus p(Tj\b, v, 1/2) ->■ G(l/2,6 2 /2) as v ->■ 0, where we 
use identity K_i/ 2 (x) = ^n/2x~ l l 2 exp(— x). Hence the distribution of /3j reduces to the 
Cauchy distribution with scale parameter b, where we use \im x ^ x w K w (x) — > 2 w ~ 1 T(w) and 
r(w) is a Gamma function. For Laplace prior, w — — 1, u — 0, and v = b^ 1 . the limit 
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of prior p(tj\u, b 1 , —1) — > IG[1, {2b 2 ) x ] as u — > 0, where IG(-) stands for inverse gamma 
distribution, and we used the index symmetry K- W (x) = K w (x). 



According to Theorem 3 of Poison and Scott (2011 ), we see r- g(rj) = p(rj\u, v, w) where 



g{rj) is the density of the subordinator Tj at (u,v, w). Analogous to the discussion with the 
simulated tempering algorithm for Ising model, we can see the temperature parameter in our 
model is Tj. However, there are several differences, such as in our model, Jy oc fyflj and we 
assign normal mixture prior for /3j's, while in the simulated tempering algorithm of regular 
Ising model, the prior is assigned to Jjj directly. 

Now we can understand the temperature effect of Tj. When Tj — > oo, which is equivalent 
to Kj — > 1, the system is in a high temperature state. This means MCMC is exploring 
the whole configuration space, and no precise sampling for a local energy minimum. This 
is also equivalent to say for each node, the odds of 7j = 1 is equal to one since in high 
temperature every node is heated up and chance to be up and down is even, which means 
the marginal selection probability for all nodes is 1/2 as shown in Figure [3] (a) as Kj — > 1. 
On the other hand, when Tj — > 0, which is equivalent to Kj — > 0, the system is in low 
temperature state. Starting from some initial state, all nodes configure will be trapped into 
their energy minimum (local maximum likelihood) which is jj = for most nodes in the 
orthogonal design, unless the external field hj oc cij is strong enough to force the node in the 
state jj = 1. This is why we see in Figure [3] (a) for small Kj the selection probability is for 
nodes with small cij and remains 1 for nodes with very large aj. 

It is easy to understand the role of b too, which is opposite to Tj if we look at the prior 
p{(3j\Tj,b) ~ iV(0, b 2 /rj). In fact, we can also understand b by representing the hierarchical 
model as p(-f\y,(3) Y[p{Pj\rj)p{T j \b), wherep(rj|6) = rj 1/2 exp(-b 2 T j /2) 1 rj 2 exp[-(26 2 r J ) _1 ] 
and (Tj + b~ 2 )~ 1 for Cauchy, Laplace and horseshoe prior respectively. Thus we can see that 
b controls how the local temperature parameter Tj distributes. Large b limits the variation 
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of Tj and increases the mass around zero, and small b means the range for Tj to vary is large. 
This is also consistent to Figure [3] (b), where we can see when b — > 0, Tj can vary widely, 
thus the system is in high temperature state, and if b — > oo, Tj will be limited around and 
the system is in low temperature state. 

The generalization of Cauchy prior and Laplace prior into the Levy process mixture 
not only shows that the connection between Bayesian variable selection and the Ising model 
with tempering algorithm, it also provides flexibility to choose priors with different shrinkage 
characteristics. We did not discuss generalization of horseshoe prior as a Levy process, further 



discussion can be found in Poison and Scott (2012), but similar conclusion can be drawn for 
horseshoe prior in terms of shrinkage or tempering. 



6 Incorporating Graph Prior Information 

In this paper, we mainly discuss model ((!]) as a graphical model with noninformative prior 
for 7, and it works well for n is large enough. However, the priori information about 7 
becomes important when n goes small. There are two purposes of incorporating graph 
prior information for 7. First, it helps to improve the mixing issue so the model works for 
n « p. Second, it improves the power of detecting the true signals. Since two connected 
nodes with positive interaction intend to be selected or excluded together, only the prior 
graph for 7 with positive interaction is meaningful. If we have the information that some 
selected nodes and their neighbors are all true nodes, then incorporating a graph prior 
with those nodes connected will improve the power to identify the nodes with small signal. 
This is because the prior tells us that those nodes with small signal have more chances 
to be selected together with their neighbors which are true signal. On the other hand, 
for those nodes that are not true signal, we have more chances to exclude their neighbors 
too since the prior tells us they should be excluded together. At first glance, assigning 
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a prior graph for 7 seems like manipulating the weight to select which nodes and their 
neighbors, but if the prior information is true, then assigning such a prior is reasonable. 
Even though the prior information is not exactly correct, it will help if the prior graph 
contains the true graph about which nodes are networked. For example, given a true model, 
y = J2jes*jPj where 5 = 1, ...,k is sequential index up to k and k < p. Obviously there 
are some information about the true variables such that there are k sequential nodes are 
true nodes, and p — k sequential nodes are not in the true model. Therefore, a Ising prior 
with one dimensional linear chain will be a very efficient prior since this prior reflects the 
information that sequential nodes are selected or excluded together. Another example is the 
genetic pathway data within which different sets of genes function together. Some gene sets 
are related to the phenotype diseases, some are not. Therefore the prior with this pathway 
graph helps distinguishing different set of genes in the pathway since among those genes 
if one node is selected then its connected neighbors have high chance to be selected too. 



Further example about incorporating prior graph information can be seen in Li and Zhang 



(2010 


); 


Monni and Li 


(2010 


); 


Stingo et al. 


(2011 


); 


Tai et al. 


(2010 



Since we are only interested in the network prior information, we only apply a graph 
prior for 7 with the interaction matrix W = {Wij} without the external field: 



p(-y) oc exp l^T Wijd. 



y 1 ' 

i<j 



where represents the prior coupling information between node % and j. For simplicity, 
considering W = wA, where w is a small positive interaction parameter, and A = {Ay} is 
the adjacency matrix with A^ = 1 if node % and j are connected and A^ = if i and j are 
independent. With this prior, the posterior distribution for 7 is modified as: 

p(i\y, A 4>) oc p(y|7, A <i>)p{i) 



K exp ( XJ J 'j <Sii + S h to ) 



(30) 
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where J*, = (Jjj + Wij). Jij and h* are defined in (10) and (11). 



Correspondingly, the two expressions for the cluster algorithm are modified as 



P. 



max si — exp 



VfcGci iSco / 



(31) 



«(7° -> 7* c ) 
= min < exp 



E(- X ) 7j B 1 - - B 1 - W + E ^ - E ^ 



(32) 



Above two expressions tell us that p a j and 0(7° — )■ 7*) are also conditional on A. 

7 Extension to Nonparametric Regression Models: 
Bayesian Sparse Additive Model (BSAM) 

Although the BVGM is based on the parametric linear regression model (JT[) , it is easy to be 
extended to nonparametric regression models. Some similar approaches have been suggested, 



such as nonparametric regression using Bayesian variable selection (Smith and Kohn, 1996) 



and Bayesian Smoothing Spline ANOVA models (Reich et al. 2009), both use the spline 



techniques. In the former case, the binary random variable is applied to each knots of spline 
function in stead of each predictor, thus the model is capable to select the knots of each 
nonparametric function. In the second paper, second order interactions are included by 
using function ANOVA. In this paper, we only employ BSAM to demonstrate how easy it is 
to extend the parametric regression model based on BVGM. 

Extending the multiple parametric linear regression model Q to an additive model is 
straightforward. In Bayesian point of view, there is no strict difference between parametric 
and nonparametric additive regression model in sense of that both assign prior to the basis 
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coefficients. In general, both choosing a basis to express the marginal regression predictor 
fj(pcj). For linear parametric regression, /j(xj) = /3jXj, where the predictor Xj itself can 
be considered as the basis to represent fj and /3j is a univariate random variable. This is 
just a special case of nonparametric regression model considering /j(xj) = Zjf3j, where Zj is 
some basis matrix for jth predictor and f3j is multivariate random variable, where the basis 
length Mj > 1 can vary for different predictor. Despite the variation of the basis chosen, 
each predictor is corresponding to a univariate random vector r 3 - = /(xj) = Zj(3j. Then the 
generalized additive model can be expressed as 

v 

y = ^ + ^ 7i / j (x j ) + e. 

For this model, similarly, we can consider following prior for /3 -'s 

[/3 i |r i ]-iV(0,6 2 Tr 1 /), 

(33) 

where N is multivariate Mj dimensional normal distribution, and p(rj) is some priors similar 
to previous discussions, such as (2(1/2, 1/2), 7(2(1, 1/2) or (2 + (0, 1). For some special basis, 
the multivariate normal prior of /3 • may have two variance components such as LS basis (see 



A.6). Note that because the dimension of /3 - is changed, if we integrate out r? by assigning 
the same p(Tj) as in parametric linear models, the marginal prior p(f3j\b) is no longer Cauchy, 
Laplace, or horseshoe prior any more, but it shares the similar properties as linear parametric 
case. 

Similarly we can define matrix R = [ri,...,r p ], design matrices Z = [Zi,...,Z p ], Z 7 = 
[71 Zi, 7pZ p ] and the coefficients vector (3 = (flj, 0E ) T , but here we should treat (3 and 
Z as blocks. The total dimension for the design matrix Z is n x M and M x lfor f3, where 
M = Mj. Without any confusion, we can use the same posterior distribution expressions 
in ([3| and ([5j to update f3 c and <p except we need keep in mind (3 and Z are in blocks and 
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D c is diagonal block matrix with block Tj/b 2 lMj,j € c in the diagonal, where Imj is Mj 
dimensional identity matrix. S c and /j, c are expressed as 

s c = {<pz?z c + D c y\ 

(34) 

M c = <PKZj (y - Z l5 (3^) . 
The calculation for Jy's and /ij's is exactly the same as (jloj) since those formulas involve R 



which is a n x p matrix for both cases. In A. 6 we will introduce a specific additive model 
with the natural cubic spline represented by Lancaster and Salkauskas (LS) basis. Of course, 
other spline basis to define Z is possible. 



8 Simulation Study 

8.1 Case One: Comparison of Three Priors 

The first simulation study will examine a simple linear regression model with a general form 

y = J2 Pj x i + e > (35) 
i&s 

where S = {2,3,5,10}, sample size n = 50 and p = 100, Xj ~ N(0,l),j = p and 
e ~ iV(0,l). Particularly, we will consider one large signal set and one small signal set: 
{/3 2 , p 3 , /3 B , M = {-4, 2, -1, 2.5} and {-0.9, 0.7, -0.6, 0.8}. 

In this simulation, we performed the single site updating with total 6000 iterations for 
each settings and discarded the first 2000 iterations as burn-in, then calculated the average 
7j's over total iV = 4000 iterations as the marginal selection probabilities. Figure [6] plots 
the marginal selection probability of all variables against the global shrinkage parameter 
b. For large signals, as shown in the upper row of Figure [6j horseshoe and Cauchy priors 
perform similarly and show the robustness of large signals, i.e., as b decreases, the selection 
probability of true signals maintains 1 till very small b and drops to 0.5. Horseshoe prior 
also shows better robustness than Cauchy prior on the large b side. Both priors have a wide 
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Figure 6: The profile curves of the selection probability of simulation model (35) with dif- 
ferent priors for large signal setting (a-c), and small signal setting (d-f). 



window of b in which the true signals are well separated from the noise signals. On the other 
hand, Laplace prior does not demonstrate such robustness for large signals: as the b — > 0, 
the selection probability of true signals drops to 0.5 very fast. Around b = 0.1, all signals 
reach the 0.5 line for Laplace prior. On the large b side, Laplace prior seems perform a little 
better than Cauchy prior. Recall the exact calculation of the marginal selection probabilities 
in Figure [3j we can see that the conclusion made from the simulation about the performance 
of the three priors is exactly the same. 

The bottom row of Figure [6] is the simulation results for small signals. In general, the 
window for true signals maintaining high selection probability gets narrower for all three 
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priors. The selection probability of true signal for all priors starts to drop to 0.5 around 
b = 0.1, and around b = 1000, they drop to 0. However, the drop rate on both side of b is 
different for three priors, resulting in different width of the working widow of b. Horseshoe 
prior has the widest window, Laplace prior gets the narrowest one. Again, the conclusion 
we made from the simulation is exactly the same as the calculation in Figure |3j 

Other observations can be found from Figure |6j especially for small signals. First we 
can see around b = 10, the selection probability of true signals first drops a little and then 
grows up again. This happens right above the peak of the selection probability of the noise, 
which indicates potential interaction between the noise and the true signals. This is easy to 
understand since when sample size is small, the correlation between true signal and noise 



is large, meaning the parameter £j in (18) is large such that the profile curve is distorted. 
The second observation from Figure [6] is, although the overall performance of three priors is 
different in sense of different width of the working window, we can select a right value of b so 
that for all priors the noise and signals are well distinguishable. For instance, for all priors, 
with a cut-off probability 0.5 all true signals are separated from the noise at some fixed b 
between 10 and 1000. 

8.2 Case Two: Three Regions of Global Shrinkage Parameter b 

Based on the case study one, horseshoe prior has the largest working window, thus in the 
rest of this paper, we employ horseshoe prior only unless stated otherwise. In this simulation 
we will examine the case when p is large, say p = 1000 or p = 500. The linear model still 



has the form (35) with Xj ~ N(0, 1), j = 1, ...,p and e ~ N(0, 1), but we consider following 



specific models and settings 

Model I A. p = 1000, n = 200, = 0.8 if j is odd; fa = 1.0 if j is even. S 
{31, 91, ...,931} U{60, 120, 960} 
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B. p = 1000, n = 500, ft = 0.8 if j is odd; ft = 1.0 if j is even. S 
{31, 91, ...,931} U{60, 120, 960} 

Model 11 A. p = 500, n = 100, ft = -0.8 if j is odd; ft = 0.8 if j is even. S 
{31, 91, ...,451} U{60, 120, 480} 

B. p = 500, n = 500, ft = -0.8 if j is odd; ft = 0.8 if j is even. S 
{31, 91, ...,451} U{60, 120, 480} 



Thus for Model I the number of true ft's are the cardinality \S\ = 32, and \S\ = 16 for 



Model II For each setting, we performed the single site updating with total 8000 iterations 
and discarded the first 3000 as burn-in, then calculated the average 7/s over total N = 5000 
iterations as the marginal selection probabilities. 

Figure [7] (a-b) and Figure [8] (a-b) plot the marginal selection probability of all variables 
against b for all settings, so we can have a overall view about all possible global shrinkage. 
In this simulation it is easier to examine how the working window of b suitable for variable 
selection changes. For example, in Figure [7] (a), the working window is from b ~ 0.01 to 
b ~ 2000, and from b w 0.001 to b ~ 3000 in Figure [7] (b). Within this window, we can see 
for both setting A and B, b can be further divided into three regions I, II and III. In Figure 
7j (a) Region I represents the high temperature or large shrinkage area, where b is around 0.1 
or smaller. Region II is a moderate shrinkage area with b between 1 and 100, and the last 
Region III is around 1000 varying from several hundreds to several thousands, the widths of 
these three regions also change for different signal strength. We can see in most cases, the 
signals are well separated from the noise in Region I and III. On the other hand, in Region 
II, if the signal is not large enough or the sample size n is small, some oscillations or strong 
interactions occur between signal and noise on the profile curves, resulting in a total mixture 
up of noise and signal. Thus if to suggest the appropriate value of b, it must be selected to 
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avoid Region II. 

Another interesting observation from this simulation is although in general, both Region 
I and III both can be used to detect signals, the performance of the MCMC sampling may 
have different properties in these two regions. We have discussed that Region I has large 
shrinkage property, but it may not have the sparse consistency. This can be understood from 
the point of view of the oracle properties of the estimation given the shrinkage parameter. 
Fan and Li 2001 shows in general the sparse consistency requires A — > or small shrinkage 
(A oc b~ 2 is the shrinkage parameter in their paper). Region III, representing small shrinkage 
area, hence may maintain sparse consistency while Region I loses it. This phenomena is 



shown in Figure [8] (a) for Model II A where the best value of b is in Region III with which 
all signals are distinguishable from the noise. On the other hand the noise and signals mix 
up in Region I. This can be seen more clearly in Figure [8] (c-d) for two specific b values: for 
b = 223, most of the true signals have selection probability 1 and distinguished from the 
noise, while for b = 0.23, some the true signals have smaller selection probability than some 
noise. 

How to determine the parameter b is a interesting topic. Some authors suggest assigning 



another prior for b, such as horseshoe prior Poison and Scott (2011). Unfortunately, because 



b is a global parameter, when p is large, the posterior distribution of b will be forced to 
some value that is not in Region III where we prefer. Therefore in this paper, we will not 
consider assigning a prior for b, instead, we consider it as a tuning parameter. A practical 
way to select b is to try several b values and choose the one we see the largest gap in selection 
probability and b is usually between ten to over thousands. 



45 



00 

o 



JD O 
O 



O O 

'o o 

a> 
<u 
CO 

co 

£ ° 

CO 

o 



o 
o 



a) n=200, p=1000 




I 


■ f 

II 


$#' \ 

m&, \ 
KB* 


; 


b) n=500, p=1000 






True Signals 

False Signals 




;: 



CO 

o 



3 ^ 

O 

O 



O O 

'o o 

CD 
CO 

CO ^-r 

£ ° 

o 



o 
o 



c) b=594 


n=200, p=1000 







o 


o 


o 

o 

Q 

o 




oo 

o Qd 
o 

OCt! 




d) b=0.09 


O 

o 









• True Signals 
o False Signals 




n=200, p= 


=1000 



1e-03 1e-01 1e+01 1e+03 
b 



200 400 600 800 1000 
Node Index 



Figure 7: The profile curves of the selection probability of Model I A and B (a-b). Selection 



probability at two b values for Model I A (c-d). 



8.3 Case Three: Comparison of Cluster and Single Site Algorithm 



As discussed in Section 5.4 assigning the scale normal mixture prior for /3j's with shrinkage 
parameter b is also a tempering algorithm, which means our model already makes improve- 
ment in the mixing issue. So there may be no much space left from improving the mixing 
with a cluster algorithm. We will show that, the performances of cluster and single site 
algorithms both are 6-dependent. In some region of b, one may outperform the other but 
performs worse in other region. 



To demonstrate this, we consider the simple simulation with the same model as (35) with 
large signals {(3 2 , /3 3 , /?5, /?io} = {—4, 2, —1, 2.5}, n = 200 and we vary p from 50 to 1500. We 
run the simulation with four representative b's, two are large and two are small, so we can 
compare the difference behavior of two algorithm with different shrinkage paramters. 

To measure the mixing or correlation time, it is convenient to define the "magnetization", 
which represents the average value of the binary random variable 7/s at ith sweep of 
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Figure 8: The profile curves of the selection probability of Model II A and B (a-b). Selection 



probability at two b values for Model II A (c-d). 



the MCMC iteration. 

v 



p — ' J 



Thus the mixing time of the MCMC iteration can be measured using the time-delayed 
autocorrelation function (ACF) of the Monte Carlo chain of "magnetization", 

= g(M(')-M)(M'' +i )-M) 

where t is the lag or the iteration time from the origin, measured in Monte Carlo sweeps 
(MCS), and M is the average magnetization over total N iterations. We assume the absolute 
value of C{t) decays exponentially, i.e., \C(t)\ ps C exp(— t/r), where C is some positive 
constant, and r is defined as the exponential correlation time. Therefore, we can use r to 
measure how fast the chain converges or mixes. The smaller the r, the faster the system 
mixes up. Another way to measure the mixing time is simply using the summation of the 
autocorrelation time, Y^t=o \C{t)\, where L is the maximum lag calculated. 
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Figure 9: The sum of absolute ACF against variable number p for cluster algorithm and 
single site algorithm at different b values. 



For each p we performed 15000 iterations or sweeps for each setting and discarded the 
first 5000. From the remaining N = 10000 sweeps we calculated the autocorrelation function 
C {t) up to L = 100 lags. Figure [9] shows the summation of absolute ACF time against the 
nodes size p for b = 0.03, 0.17, 1141 and 2195. 

From Figure [9] we can see the different behavior of cluster algorithm and single site 
algorithm. In large shrinkage region, b = 0.03 or 0.17, the cluster algorithm has mixing time 
uniformly smaller than the single site algorithm. Note that as the node size increases, the 
mixing time for all algorithm first decreases slightly and then stabilizes. It may goes up when 
p goes further. This profile is not well understood yet. Probably because in large shrinkage 
area, the effect of large node size is pressed by the shrinkage when the number of true nodes 
is fixed and small. Nevertheless, in this region, we can conclude that cluster algorithm is 
uniformly outperform the single site algorithm in terms of fast mixing time, and the mixing 
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time of cluster algorithm is at least two times shorter. 

In the small shrinkage region where b = 1141 and 2195, as shown in Figure [9] (b), we see 
different characteristics. First, the measured mixing time is much more noisy than in Figure 
[9] (a), but the trend against p is clear. Secondly, unlike large shrinkage area, here we see for 
both algorithms the mixing time increases as p increase. Furthermore, when p is small, the 
single site algorithm has shorter mixing time, but slow down very fast as p increases. For 
example, when p = 60, the summation of autocorrelation function is only several MCS, but 
reaches almost 100 MCS when p is large than 1500, which means extremely slowing down for 
the MCMC process. On the other hand, although the cluster algorithm is about two times 
slower when p is small and it also slows down with p increases, the mixing time increases 
with smaller rate and reaches no more than 50 when p = 1500. 

Hence in general, we can see cluster algorithm outperforms single site algorithm in terms 
of mixing time. However, which algorithm should be used depends on the data. Single site 
algorithm is much less time consuming since the cluster algorithm spends time in forming the 
cluster. The overall computational time for cluster algorithm is expensive when p > 1000. 
Plus, in many situations, the mixing time may not be so worse for single site algorithm. 
Thus we prefer using single site algorithm to achieve the results quickly. 

8.4 Case Three: Bayesian Sparse Additive Model 

In this section, we demonstrate variable selection on following Bayesian sparse additive 
model: 

V = /1O1) + fafa) + f a (x 3 ) + / 4 (x 4 ) + e, (36) 

where Xj = (wj + tu)/(l + t),j = 1, ...,p and w%, ...w p and u are iid from Uniform (0,1), and 
e ~ iV(0, 1.74). Therefore Corr(x i ,x i ) = t 2 /(l + t 2 ) for i ^ j. We consider t = and t = 1. 
The later one gives the correlation between two predictors around 0.5. This simulation is 
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similar to Example 1 in Lin and Zhang (2006) but with p = 10, 80 and 150. We also consider 



sample size n = 100. Functions //s have following forms. 

fi(x) = x, 
f 2 (x) = (2x - l) 2 , 
h( x ) — sin(27rx)/[2 — sin(27rx)], 

f 4 (x) = 0.1 sin(2vrx) + 0.2 cos(2vrx) + 0.3 sin 2 (27rx) + 0.4 cos 3 (2ttx) + 0.5 sin 3 (27rx). 

for each x,,-, the LS basis employs two precision parameters r ej - = cr~ 2 and 



(37) 



A.6 



As shown in 

Tdj = a d f. We treat all set of {T e j,Tdj : j = 1, ...p} independently. Similarly, we can still 
assign (2(1/2,1/2), 7(2(1,1/2), or (2 + (l) prior for them. However, since /3- is the Mj x 1 
vector and for each node we have two variance components, the marginal prior for f3j given b 
is no longer simple Cauchy, Laplace or horseshoe prior any more, but it will share the similar 
properties to its counterpart in linear parametric model. In this simulation, we employ the 
independent (2(1/2,1/2) prior for each r e j and 7$ only. For the number of knots of the 
LS basis, we may consider each predictor has different number of knots, but it turns out 
a fixed number for all M/s, say Mj = 6, will give good enough results. Therefore, we fix 
Mj = 6,7 = l,...,p in this simulation. Totally 6000 iterations have been employed by the 
single site algorithm and first 2000 ones are discarded for all settings. 

Table 2: Simulation results of sparse additive model (36) for 500 runs 



t 


V 


FP-rate 


FN-rate 


MS 


/iSE 


/ 2 SE 


/ 3 SE 


/ 4 SE 




10 


0.00(0.02) 


0.00(0.03) 


3.99(0.17) 


0.07(0.05) 


0.16(0.06) 


0.18(0.08) 


0.74(0.27) 





80 


0.00(0.01) 


0.00(0.03) 


4.16(0.48) 


0.07(0.05) 


0.15(0.06) 


0.18(0.08) 


0.70(0.27) 


BVGM 


150 


0.00(0.02) 


0.01(0.04) 


4.81(5.52) 


0.08(0.09) 


0.16(0.07) 


0.18(0.11) 


0.73(0.44) 




10 


0.00(0.01) 


0.09(0.10) 


3.56(0.54) 


0.08(0.07) 


0.18(0.09) 


0.16(0.08) 


0.79(0.40) 


1 


80 


0.00(0.01) 


0.09(0.11) 


3.68(0.66) 


0.09(0.08) 


0.18(0.08) 


0.16(0.07) 


0.77(0.39) 




150 


0.01(0.03) 


0.11(0.11) 


4.48(7.17) 


0.10(0.10) 


0.18(0.09) 


0.18(0.10) 


0.80(0.40) 





10 


0.00(0.01) 


0.00(0.00) 


4.00(0.06) 


0.07(0.04) 


0.05(0.04) 


0.11(0.06) 


0.32(0.13) 


80 


0.07(0.08) 


0.18(0.07) 


9.85(11.2) 


0.17(0.28) 


0.79(0.11) 


1.55(0.32) 


5.28(0.58) 


cosso 

1 


10 


0.01(0.03) 


0.04(0.08) 


3.86(0.48) 


0.07(0.07) 


0.26(0.10) 


0.14(0.10) 


2.00(1.00) 


80 


0.10(0.09) 


0.19(0.10) 


12.5(14.0) 


0.36(0.40) 


0.37(0.16) 


1.05(0.44) 


4.67(0.54) 
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Figure 10: The profile curves of the selection probability of simulation model (|36j) with 
p = 80, n = 100 for independent setting (t=0) (a), and correlated setting (t=l) (b). 



Figure 10 (a-b) show us how the selection probabilities changes for a range of b with t = 



and t = 1 for simulation setting p = 80. Note that in Figure 10 (a), one true signal is buried 



in the noise till 6=1, while the same signal is always mixed with noise in Figure 10 (b). 



As shown in Figure 10, when b ~ 26 all false signals go to and we achieve the largest gap 
between signals and the noise. 

One feature of our BVGM is the capability to select the important variables as well 
as estimate the selected function components at the same time. The four true functions 
and one noise, the corresponding estimated functions, and the selection probability for all 
nodes at b = 26 of a simulation run with p = 80 for t = and t = 1 are shown in Figure 



lTj and Figure [12] respectively. Based on the LS basis, the function components estimated 
are always centered, so we also centered the true functions. In this simulation run, the 
four true nodes are selected exactly, and the estimated functions of them are calculated by 
fj = ZjE(j3Ajj = 1), where the expectation is based on the iV = 4000 iterations, and 



51 



o!o 1 o!4 1 o!s 1 

x 2 



P o- 




o:o 



— i — — i — ttz — r 



OA 

x 20 



0!8 



o:o 



— i — rr; — i — ztz — r 



0!4 

X 50 



0!8 



O 

o- 



o 
in" 



o 
o 



— *» 



0:2 



tt; — 1 — ztz — r 



0:6 
X11 



1:0 




25 45 65 85 
Node Index 



Figure 11: True function f/s (blue dashed lines) and estimated function /j's (blue solid 
lines) with 95% credible interval (red dashed lines) for the 4 true nodes (a-d) and a noise 



node (e) of a run of simulation model (36) with independent setting t = and p = 80. The 
marginal selection probability at b = 26 (f). Note we reordered the first 4 true nodes number 
to (2, 20, 50, 70) for a better view. 



the 95% credible intervals are plotted as well. As shown in Figure 11 and 12, for both 



t — and t — 1 the estimated functions are very close to the true functions. Note for the 
noise function, f u , the selection probability is close to zero, thus the estimated function is 
calculated by fj = ZjE(J3j), a expectation over all iV iterations. This is why we see a very 
wide credible interval for / n because when jj = the posterior of /3a is multiple normal 
with large variance. Also note that to have a better view of the selection probability, we 
reordered the nodes such that the 4 true nodes are (2, 20, 50, 70). 

To further examine the performance of variable selection and estimation accuracy of 
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lines) with 95% credible interval (red dashed lines) for the 4 true nodes (a-d) and a noise 



node (e) of a run of simulation model (36) with independent setting t = 1 and p = 80. The 
marginal selection probability at b = 26 (f). Note we reordered the first 4 true nodes number 
to (2, 20, 50, 70) for a better view. 



our method, 500 simulation runs have been employed for p = 10, 80 and 100 respec- 
tively. We calculated seven statistics: "False Positive Rate (FP-rate)", "False Nega- 
tive Rate (FN-rate)", "Model Size (MS)", and "Squared Error (SE)" of 4 true functions, 

where FP rate ^False Positive+#True Negative ' rate j^False Negative+#True Positive ' 

SE = J^iifjA ~ fj,i) 2 / n - The estimated function is calculated by fj = ZjEffiAjj = 1), j G 
{true nodes}. Since it can happen that p(jj = l|y) = for any true function components, 
we simply estimate fj by fj = for the 4 true nodes if p(jj = l|y) = in each run. Statis- 
tics SE can be used to assess the accuracy of the estimation of the nonlinear function fj 
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because the smaller the SE the closer the estimation fj to true function fj. The average and 
standard deviation of those statistics over 500 runs are reported in Table [2] and compared 



with Component Selection and Smoothing Operator (COSSO) ( Lin and Zhang| 2006). 

As shown in Table [2j the results for our method is pretty robust to p. For each t, all 
statistics are similar for different p except a little increase in the mean and standardized 
deviation of those statistics. For different t, our method is also pretty robust, except the 
increase in the values of FN-rate and SE's. On the other hand, we can see COSSO only 
performs well for small p. When p = 80 (COSSO can not work for n < p case, so no 
result for p = 150), all the statistics of COSSO increase, especially for those four true 
function components, SE's are very large meaning COSSO can not estimate those function 
components correctly. In general, we can see our method works very well for BSAM even 
for large p and large correlation cases in both variable selection and function component 
estimation. 

8.5 Case Four: Linear Chain Prior 



Again, we consider the same form of model (35) but with setting p = 100, n = 100, (3j = 0.4 



if j is odd, (3j = 0.8 if j is even, and S = {1, 2, 15}. This example is special in sense 
of its true predictor set S and false signal set S both are continuous in their node index. 
Obviously, the simplest prior network information is a linear chain: any node's two neighbors 
are most likely to aligned to this node. Although this is not true for neighbored node 15 and 
16, but this discontinuity has small effect on the whole system. With this knowledge, we 
would consider the linear chain prior for the nodes. W = {wXij}, \j = 1 for \i — j\ < 1 and 
Xij = otherwise. In order to have a exchangeable prior, two boundary nodes 1 and p can 
be treated as neighbors, i.e., Ay = 1 if \i — j\ = p— 1. To fully use this prior information, we 
employ the cluster algorithm and using this adjacency matrix A = A^ to form the cluster. 
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We also compare the results with noninformative prior (employed by single site algorithm). 




b) With Noninformative Prior 




c) With Linear Chain Prior 




Figure 13: The graph of a linear chain prior with 20 nodes with 1 through 10 nodes "in" (a). 
The profile curves of the selection probability of case four model calculated by the cluster 
algorithm with noninformative prior (b), and with the linear chain prior for 7 (c). 



Figure 12 (a) shows a example graph of a linear chain with p = 20 nodes, note how 



the two end nodes connected. Figure 13 (b-c) are the selection probability profile plot with 
noninformative prior and linear chain prior. For each b in both plots, total 6000 iterations 
have been employed with first 2000 burn-in. For the linear chain prior we take v = $(log(6)) 
where $ is the standard normal CDF such that the interaction strength vanishes for large 
shrinkage and maintains at 1 for small shrinkage. The difference of two plots is obviously: 
with the noninformative prior, two true signals are very close to the noise for all range of b 
and hard to be separated from the false signals, while with the linear chain prior, we can see 
for a large range of b all the true signals are well distinguishable from the noise. 



9 Real Data Analysis 



9.1 Ozone Data 



As an illustration of BSAM implemented by BVGM, we consider an example, the ozone 



data analyzed by Lin and Zhang (2006). The ozone data is available in R package cosso 
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or gss. In the Ozone data, the interesting response variable is the daily maximum one- 
houraverage ozone concentration and eight meteorological variables were recorded in the Los 
Angeles area for 330 days in 1976. The sample size n = 330, and the 8 variables are Height 
(Hgt), Wind Speed (WS), Humidity (Hum), Temperature (Temp), Inversion Base Height 
(InvHt) , Pressure (Press), Inversion Base Temperature (InvTp), and Visibility (Vis). All 
predictors were standardized and the response was transformed using logarithm to have 



normal distributed response. We applied the Bayesian graph model described in Section 5.4 



with Mj = 6 for all predictors. Total 20000 iterations have been employed with single site 
algorithm and half of them were discarded as burn in. By quickly examining a series of b, we 
see the selection probability profile curves are all well defined due to small variable number 
(not shown). So it is more appropriate to choose a modest shrinkage, which is b = 1.6 in this 
case such that all selected predictors reach their highest selection probability. At b = 1.6, 
two predictors have selection probability less than 0.5. 



The estimated results for /3^'s are summarized in Figure 14, where the additive function 
components, fj = ZjE(f3j\^j = l)'s, are plotted with 95% credible interval. Because the 
smallest P = p(7j|y) is at least 0.14, we have enough iterations for all jj = 1 to estimate all 
fj. The marginal selection probability, P, for each variable is labeled in each plot. We can 
then identify three groups of the variables. The first group has ate least P = 0.88 including 
Temp, Press, InvHt and Vis. The second group includes InvTp and Hgt with P = 0.69 and 
0.58. The last group contains Hum and WS with P smaller than 0.5. In variable selection 
point of view, we will select all variables in the first group surely, and we will not select 
the last group since their selection probabilities are very close to the baseline. Because of 
the small variable number and nearly independence of each variable, we can consider the 
second group as true variables with small signals. In the point view of function components 
estimation, the selection probability is consistent to it signal estimation. As shown in Figure 
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InvTemp (F) Height (meters) Humidity (%) WindSpeed (mph) 



Figure 14: Estimated function fj (blue solid lines) with 95% credible interval (red dashed 
lines) for the 8 predictors of ozone data labeled by the marginal selection probability P = 
p(7j = l|y) at b = 1.6. 



16, for the first group of variables the credible intervals only cover a small part of the zero 



line, while for the third group of variables, the zero line is almost in the center of the credible 
interval. Although the credible intervals of the second group of variables cover the zero line 
totally, the zero line is close to the edges of the credible interval. 

We also report the summary statistics for all variance components (their inverses) and 



the intercept in Table |3j In this example we include the intercept term /j, in model (36) and 
assign a prior for \i: [ji] ~ iV^r" 1 ), and a Gamma prior for r M : [t m ] ~ G(4,2). The full 
conditional distributions for /x and r M are easy to derive (not shown here). Note how the 
posterior means of these parameters adapt to the data. Especially for r e /s and r^'s, all start 
with the same prior, but the posterior means are different. For the first group variables, their 
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posterior means for r e j's and r^-'s are obviously different from their priors. 

Table 3: Parameter estimation of ozone data under Bayesian sparse additive model at b = 1.6. 



Parameter 




Prior 






Posterior 




Mean 


Std.Dev. 


Aloc.ui 


Std.Dev. 


Median 


Lower 2.5% 


Upper 2.5% 


Temp 


1.000 


1.414 


0.671 


1.192 


0.150 


0.003 


4.199 


Press 
' e 


1.000 


1.414 


0.787 


1.270 


0.282 


0.008 


4.570 


1 e 


1.000 


1.414 


0.198 


0.597 


0.021 


0.000 


1.700 


T-Vis 
1 e 


1.000 


1.414 


1.316 


1.517 


0.805 


0.021 


5.485 


InvTp 
7~e 


1.000 


1.414 


1.010 


1.365 


0.500 


0.010 


4.884 


Hgt 
1 e 


1.000 


1.414 


1.062 


1.392 


0.547 


0.005 


5.032 


Hum 
' e 


1.000 


1.414 


1.050 


1.431 


0.509 


0.001 


5.129 


' e 


1.000 


1.414 


1.104 


1.505 


0.545 


0.001 


5.334 


Temp 
T d 


1.000 


1.414 


0.226 


0.724 


0.026 


0.002 


2.212 


-.Press 
' d 


1.000 


1.414 


0.033 


0.047 


0.021 


0.002 


0.125 


d 


1.000 


1.414 


0.889 


1.352 


0.349 


0.004 


4.731 


T d 


1.000 


1.414 


0.751 


1.235 


0.227 


0.004 


4.226 


InvTp 
T d 


1.000 


1.414 


1.122 


1.413 


0.611 


0.005 


5.227 


Hgt 
T d 


1.000 


1.414 


1.045 


1.429 


0.521 


0.009 


5.140 


Hum 
T d 


1.000 


1.414 


1.061 


1.474 


0.513 


0.001 


5.048 


-WS 

T d 


1.000 


1.414 


0.949 


1.383 


0.404 


0.000 


4.985 


<t> 






6.602 


0.542 


6.585 


5.599 


7.681 




0.000 




2.143 


0.066 


2.145 


2.011 


2.265 




2.000 


1.000 


1.043 


0.495 


0.971 


0.316 


2.205 



9.2 Gene Selection in Pathway Data 



Mootha et al. (2003) presented an pathway based analysis to test a priori defined path- 



ways for association with the diabetes disease. A pathway is a predefined set of genes that 
serve a particular cellular or physiological function. Therefore a genetic pathway can be 



expressed by a graph to prrsent the gene network within this pathway. Mootha et al. (2003) 



identified several significant pathways among which "Oxidative phosphorylation" , "Alanine- 
and-aspartate metabolism" et al. are interesting ones. However, even with those significant 
pathways identified, gene selection in microarray data analysis is still difficult because al- 
terations in gene expression are modest due to the large number of genes, small sample 



sizes and variability between subjects. Stingo et al. (2011) provide a Bayesian technique to 
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incorporate biological information into linear models to select genes and pathways. Similar 
to 



Stingo et al. (2011), we also incorporate the pathway network information into our graph 



model, and apply it to gene selection of the diabetes data from Mootha et al. (2003). How 



ever, in our method, we use the gene network information in the pathway as the prior for 7, 
and we don't select pathways. The data contains gene expressions from n = 35 subjects, 17 
normal and 18 Type II diabetes patients. We merged three interesting pathways, "Oxidative 
phosphorylation", "Alanine-and-aspartate metabolism" and "Glutamate-metabolism" into 
one graph with total p = 173 nodes (some nodes are different probe sets of the same gene, 
so the gene names are identical) which is a subgraph of the corresponding merged graph 
obtained from KEGG database. The response y is the continuous glucose level. 
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Figure 15: Profile curves of the selection probability of genetic pathway data with noninfor- 



mative prior for 7 (a), and with informative prior as (38) (b). 



The top left plot of Figure [16] shows the network of our merged gene set. Note, the prior 
required for our graph model is undirected graph with positive interaction only. We can see, 
most of the nodes are independent in this data set, and there are only three genetic clusters. 
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Because of this, if we apply the cluster algorithm and use the adjacency matrix A based 



this network information into expression (31), we will end up with a few nodes in the same 



genetic cluster that can form the clusters for the algorithm. Therefore, we consider following 
interaction matrix W = {wijXij} for the prior of 7 with adjacency matrix A = {A^} as 

Xij = 1, i,j = l,...,p 

w, i g S or j £ S ( 38 ) 

w + Aw, % e S and j G S, 
where S represents one of the three genetically networked gene clusters in the pathway 

network, w and Aw are small positive numbers stand for the strength of the interaction in 



Wij 



the prior and the difference of two types of interaction. If Aw = 0, we can consider (38) 
as a baseline graph prior for 7, which is a complete graph with positive fixed interaction. 
Since we also vary b to have an overall view about the selection probability, it is necessary 
to have w — » when b — > since with large shrinkage J^- — > and we don't want w^ 
dominates the graph interaction. One convenient way is to express w as w = wo$[log(6)] 
which approaches as b — > and reaches the maximum wo for large b, where $(•) is the 
CDF of standard normal. Note that the choice of w is involved in the consideration of so 



called phase transition (Li and Zhang, 2010). If w is too large all the nodes will always be 
connected which leads to either all nodes are selected or none are selected. Now we consider 
Aw 7^ 0, say Aw = 5w, so we incorporate the genetic network information into the graph 
prior. Aw can not be too large, otherwise those genes in the genetic clusters will always 
be aligned which means in this data set they will all have small selection probability. So 
we choose the Wq ~ 0.01 as small as possible to avoid the phase transition phenomena, but 
it is must be large enough to reduce so called region II of b caused by small sample size. 
With this selection, we have the prior interaction w^ ~ 0.01 for two nodes not in the genetic 
cluster together for large b, and ~ 0.06 for two nodes in the cluster for large b. 
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As shown in Figure 15, the effect of incorporating prior information for the graph model 
is obvious. We run the cluster algorithm for total N = 40000 iterations, and discarded the 
first 10000 as burn-in. So the selection probability is calculated by taking the mean of 7 over 



30000 iterations. In Figure 15 (a), with noninformative prior for 7, we can see even though 
we are still able to identify several genes behaving differently from the rest (highlighted by 
solid lines in the plot), for the moderate value of b all the curves are mixed. On the other 



hand, in Figure 15 (b), with informative prior for 7 defined as (38) the profile curves are 
much "cleaner" even for moderate b. Around b = 10 we can see a bunch of curves are clearly 
distinguishable from the rest. We highlighted 6 nodes with highest selection probability 



around b = 10 in Figure 15 (b). 



To examine more details of the results, in Figure 16 we fixed b = 8.5 and run the 
cluster algorithm for N = 60000 iterations with first 20000 discarded. With this shrinkage 
parameter, the prior interaction parameter Wij ~ 0.06 for i and j in the genetic cluster, 
and ~ 0.01 otherwise. The selection probability for all nodes are shown in bottom 



left of Figure [16] where we take a cut-off probability as 0.2 and identify 6 nodes that have 
relative high selection probabilities. Among those nodes, UQCRB has the largest selection 
probability for all range of b, so it is easy to identify UQCRB as the most significant gene. 
We also select other five genes, COX8, ATP5G2 (two probe sets), ATP5H and CRAT at 
b = 8.5. All the genes selected except CRAT are from "Oxidative phosphorylation" pathway 
which is related to ATP synthesis. It is well known ATP plays a importance role in Type 
II diabetes disease. CRAT is from "Alanine-and-aspartate metabolism" pathway. Both 
"Oxidative phosphorylation" and "Alanine-and-aspartate metabolism" pathway are two top 



significant pathways identified using random forrest tree approach (Pang et al. 2006). 

Since our cluster algorithm forms the Wolff cluster at each iteration, a byproduct of the 
MCMC sampler is the frequency of two nodes being aligned or anti-aligned when they are 
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in the cluster. The top right plot in Figure 16 is the heatmap matrix of the frequency of 
two nodes being aligned in the cluster out of 40000 iteration, and bottom left plot is the 
frequency of two nodes being anti-aligned in the cluster. The color bar of two plots shows 
the scale of the frequency, the darker the color the lower the frequency. In the top right plot, 
the dark colored lines are those genes have lower chance to be aligned to the others when 
they form the cluster, and in the bottom left plot, the bright colored lines are the same genes 
but with high chance to be anti-aligned to others if they form the cluster. Note those lines 
are consistent to the genes with high selection probability in the bottom right plot. This is 
because most of the genes have low selection probability around 0.05 then those genes with 
higher selection probabilities should have lower (higher) chance to be (anti-) aligned with 
them. For individual node, we define it is always self-aligned, so the diagonal in top right 
plot is 1, meanwhile an individual node is never anti-aligned to itself, so diagonal in bottom 
left plot has value 0. The distinguishable color of those genes in two heatmaps show that we 
can also use the cluster information to distinguish genes. 

So far, we identify 6 genes (probe sets) with cut-off probability 0.2, we may decrease the 
cut-off to select more genes. However, the selection probabilities are low for most of the 
genes except for UQCRB at fixed b. This is because of the problem of modest alterations 
for single gene selection, or it simply means the signals are weak. Here we selected those 
genes not only depending on the selection probability at fixed b, in stead we select them by 



examining their overall profile as shown in Figure [15j We also demonstrated that the graph 
model variable selection can easily adopt the prior graph information, thus we can consider 



similar approach as Stingo et al. (2011) to select networked pathways, which may result in 
higher selection probability for pathways at the optimal shrinkage parameter b. 
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Figure 16: Summary of the results for the genetic pathway data. Top left: genetic network 
structure of the data. Top right: the frequency matrix of two nodes aligned in the cluster 
over total iterations. Bottom left: the frequency matrix of two nodes anti-aligned in the 
cluster over total iterations. Bottom right: Selection probability with cluster algorithm at 
b = 8.5 with informative prior (38). 
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10 Discussion 



The goal of this paper is to present BVGM from two major aspects. The first is how to sample 
the "in" or "out" binary random variable. We pointed out that Bayesian variable selection 
can be considered as the binary random process on a complete graph given noninformative 
prior for 7/s, and we compared the single site and generalized Wolff cluster updating algo- 
rithm. Another one is how to construct the interaction matrix of the complete graph, which 
is implemented by sampling the linear model coefficient through the scale mixtures of 
normal priors. We also discussed the marginal selection probability profile under different 
shrinkage parameter and compared three prior settings for (3j which represent three typical 
situations of shrinkage proportion. Our BVGM method possesses the advantages of simple 
form, easy implementation and straightforward to extension. For example, the BVGM is 
very easy to extend to Bayesian sparse model by representing the nonparametric function 
components fj as linear combination of the basis matrix fj = Zj/3j, then we can employ 
the group selection of vector Another example is to incorporate network information 
for 7. Although this paper does not focus on how to construct the prior network structure 
information, the simulation and real data analysis show that it is easy to incorporating the 
prior graph information and improve the performance of BVGM. This paper also systemati- 
cally studies the behaviors of the marginal selection probability against the shrinkage. Both 
theoretical and simulated results show that to have the largest gap between the signals and 
the noise it is critical for the scale mixture normal prior to maintain substantial proportion 
on small shrinkage. 

However, this paper only starts a different view angle about Bayesian variable selection, 
further research includes but are not limited to following questions. 

I. As shown in Theorem 3 we have E(/3j\y, 7^ = 1) = ^- log 71^ for orthogonal design. 
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This equation reveals the relationship between the selection probability and the signal 
magnitude: the larger the signal, the further the selection probability profile being 
separated from the baseline. However, this relationship does not provides a cut-off rule 
to separate the signals from the noise. As we can see in the paper, at different shrink- 
age parameter b, the selection probabilities are different. Thus at different shrinkage 
parameter, the cut-off line should be different too, and we simply choose b where there 
is a largest gap between two set of signals. However, p(jj\y) > 0.5 does not mean 
the corresponding predictor should be selected, such as for b — > many noise predic- 
tors have selection probability no less than 0.5; and p(7j|y) — > does not guarantee 
the corresponding predictor should be removed, since all predictors have zero selection 
probability for very large b with orthogonal design. Further research should show the 
consistency of selecting the predictors based on the "largest gap" rule, and provides 
more straightforward method to choose b. 

2. Limitation of fixing b. The global shrinkage or temperature parameter b is fixed, 
which limits the performance of our method since it limits the range of local shrinkage 
parameter. Assigning prior for b is not appropriate too due to the high dimensionality 
such that the posterior distribution of b is forced to be very small. To automatically 
have b large and small at the same time, we can adopt a remedy similar to exchange 
Monte Carlo by running parallel MCMC's at two or more 6's with some b small and 
the other large, and exchanging their configuration according to certain probability 
satisfying the detailed balance. This remedy may improve the performance of results. 

3. Bayesian sparse additive model with interaction. Our BSAM does not include the 
interaction terms, but it should be easy to extend to include them. The only problem 
is figure out how to represent the interaction function components. This can be done 
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under the spline ANOVA models similar to Reich et al. (2009). 



4. Prior graph information. With a known networked graph prior, we have better per- 
formance in some cases because the prior reduces the searching space for 7. However, 
there is no way to have exact knowledge about the network prior information for the 
predictors, and it is difficult to construct a meaningful network as the prior. So this 



keeps a open question as discussed by Li and Zhang (2010) and Monni and Li (2010). 



Appendix A 

A.l Proof of Theorem 1 

In general, in the Markov chain of MH algorithm, the move from current state 7° to the 
proposed state 7* in the cluster has the transition probability, P{"f° c — > 7*), which satisfies 
the detailed balance condition 

-> 72 - p(nTc\y,M r m ~ _ u( 0)] } (A11) 
PiYc^'rt) pW\y,M P| [ (7c) (7c)JI ' ( } 

The transition probability can be broken down into two parts: 

where g(-) is the selection probability, which is the probability given 7° that the new target 
state generated, and A(-) is the acceptance ratio. Thus 

Now we consider the move 7° — > 7*, starting with a particular cluster c and then adding the 
others to it in a particular order. Consider also the reverse move, which takes us back to 7° 
from 7*, starting with exactly the same cluster (except the state in the cluster is flipped), and 
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adding the others to it in exactly the same way as in the forward move. The probability of 
choosing the cluster (if the cluster is the seed node) is exactly the same in the two directions, 
as is the probability of adding each node to the cluster. The only difference between the two 
directions is the probability of "breaking" bonds around the edge of the cluster. Because 
the cluster couples with all j G c, for both directions, there are |c| bonds which have to be 
broken in order to flip the cluster. These broken bonds represent the affinity between the 
cluster and the spins which were not added to the cluster by the algorithm. We represent 
the probability of not adding such a node in forward move as l—p®j,j G c and in backward 
move as 1 — p* -, j G c. Thus the probability of not adding all of them, which is proportional 
to the selection probability g("f° c — > 7*) for the forward move, is fX/ec^ ~Paj)- m the reverse 
move then the probability of doing it is fljec(l — Paj)- The condition of detailed balance, 
Equation (A. 1.1), along with Equation ( |A.1.2 ), then tells us that 



(A.1.3) 

Note that the energy change f/(7*) — U(~f° c ) is only determined by the bonds (the coupling 
between c and c) and coupling of c with the external field h*, i.e., 

urn - u{<® = 5>iy» (j2 J ^ - E + E h i - E h l (a.ia) 

j£c VfcGco 26ci / jeci jr'ec 



The first part of right hand side of Equation (A.1.4) can be decomposed as 



A E(- X ) 7j f E J * - E J t) + c 1 - A ) E(-!) 7j f E J * - E A 



With the probability of adding a node j G c to the cluster, p 0)J -, defined as (15), 



exp -A(-ip X) J i*-E J i' 
I Vfceco zeci 
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Substituting above equation into Expression (A. 1.3) and rearranging, we derive the accep- 



tance ratio for the moves in the two directions as 



exp 



(i-A)£(-i) 7, (E J i*-£ J i' 



and the acceptance probability for move from 7° to 7* is 



jGc 



j'eci 



As well as satisfying the detailed balance, the algorithm also guarantees the ergodicity by 
the fact that there is always a finite chance that any spin will be chosen as the sole member 
of cluster of one, which is then flipped. The appropriate succession of such moves will get 
us from any state to any other in a finite time as ergodicity requires. 



A. 2 Proof of Theorem 2 

The proof of the first part in Theorem 2 is simply algebra calculation. First define y* 
y — J2k^j 7fc x fc/3fc, and integrate out (3j and f3j separately in following expression 



p(7|y,T,6)cx J p(y\j,(3)p((3\T,b)d/3 
oc / exp 



1 2 

~2 (y* -mPj) 



oc exp 



6 2 



p(/3j\Tj, b)d(3j Y\_P(Pk\T k , b)d(3 k 
,2 \ 1/2 

' £(7?' 7j; ^"j)) 



7i + Til* 



(A.2.1) 
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2 \ % + 6 2 



-i 



where £(7j, Kj,~fj,Tj) is calculated by integrating out fly. 
s(",-'',-7,-T. ; ) oc y exp 

y y ^ Y\.P(Pk\r k ,b)dp k 

[0f(c^ ] )fl.-2a j ^fl d 
P(Pj\ T 3> b ) d Pj 



x exp 



oc J exp 
x exp 
oc exp 



k+3 



2 V 7j + 6 2 



1 / N 2 



1 «J J )ajC 7 , 



i%r /2 pii i/2 , 



(A.2.2) 



where O^- = + X^_X^~. — (1 — Kj) 73 (c 7 ,c 7 ,)] 1 and is easy to show it is positive definite. 
We also used the identity jj/ (jj + p-) = (1 — kJ j ). Note if jj — 0, £(• • • ) does not depend 



on or Tj. 



Then by definition and above expressions, 



/ E 7J P(lj = h 7j|y, t, b)p(r)d7 



7T; 



/ E TJ £(7j = 1, «i, 7j, T 3 )p{T 3 )di 
I E T5 £(7j = 0, «j, 7j, T 3 )p(T 3 )di 



p( T j)dTj 



(A.2.3) 



7Tj£jP(Hj)dKj 



with defined as (20). It is easy to show that £j = 1 for orthogonal design since c 7j = 
for all j. 

The proof of the second part for Hj is trivial. Obviously, ttj — > 1 as Kj — > 1 and 7Tj — >■ as 
— >■ 0. For 7r^, it is more convenient using 7Tj = j TTjC,jp(Tj)dTj, where irj and £j are measur- 
able functions indexed by b. Both ttj and £j are bounded by some positive number for all b. 
When b — > 0, lim^o 7Tj — >■ 1 and lim b _> £j — >" 1; thus according to Lebesgue's Dominated Con- 
vergence Theorem (DCT), lirn&_ >0 'fij = lim^o f ^jCjPi T j)dTj = f lim fe ^ ( 7r i^j)p( r j)^ r j — 1- 
When b — > oo, lim^oo 7Tj — > and lim^oo ^ equal to some finite number. Again the limit 
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and integral commute by DCT, thus we have lirn&_>.oo tt? = J lim6_ ) . 0O (7rj ;£j)p(Tj)drj = 0. 



A. 3 Proof of Theorem 3 

The existence of rrij indicates the marginal prior p(flj) = f v{Pj\ T j)vijj)dTj is bounded for 
/3j G R, which is true for Cauchy and Laplace prior. Using identity 

rp 8 



{Pi - a j)pWj,ij = K 



p(y|^i>7i = !)> 



so that 



dy 



m A E (l3j\y,ij = i) - flj] = / - %)p(y|/3j,Ti = l )p{Po)dPj 



(A.3.1) 



<9 



x ^ logm ^- 



Following the lemma given in Pericchi and Smith (1992), the interchange of the derivative 



and the integral is justified. The second result of Theorem 3 is straightforward by observing 
oc exp (— \y T y) n*, thus 



m 







-a, + 



d log TTj 



X"^ 771 ■ = X"^ I — y ~\~ 

3 dy 3 3 \ da,j dy ) " J da 



For horseshoe prior, p(/3j) is not bounded. However, using the technique introduced in 



Carvalho and Poison 



(2010) by defining m* = J p(y|/3j,7j = l)p((3j\Tj)p(rj)r- 1 df3jdTj, it can 



be shown 

E(/3 J |y, 7 , = l) = -^xj|-logm*, 
and similar arguments then follow. For horseshoe prior, it also can be shown 



3 •* 3 "* ^ / 

where 7rj = J TXjTj x -p{jj)dTy 
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A. 4 Proof of Theorem 4 



It is more convenient to use following equivalent representation of 7rJ 



7T; 



exp 



[I + Ti 



[rj/il+r^pir^dr. 



where p{rj\b) is corresponding prior of Tj given b such that 7T* = J p(j3j\b)p(Tj)d,Tj = 
J P{Pj)p{ T j\b)d T j- Then the similar condition for p(Tj\b) can be derived from the condition 
for p(tj) in Theorem 4, i.e., 

A 



p((Jj\b) ~ (cr|) Q 1 exp ( -y^tf ) L b ((Tj)da'j, as cr^ -> oo, 



where L b is the slowly varying function conditioning on parameter b. 
Then the marginal odds ir b can be expressed as 



(A.4.1) 



ir b oc exp ( ^ j m b (aj) = exp (^^j j u j 1 ex P 



t)p(u*\b)<kJ*, 



where w| = 1 + erj, and the integral, m b (aj), is a scale mixture of normals. Now the proof is 
similar to 



Poison and Scott 



(2011). If prior p(a^ \b) satisfies the conditions defined in (A.4.1) 



so does p(uj\b) satisfy similar conditions, i.e., 



as Uj — >■ oo 



Then following Theorem 6.1 of Barndorff-Nielsen et al. (1982), as aj —> oo, m b (aj) can be 
approximated as 



|2 - 1 L 6 (oj) 



m b (aj) ~ 



if A = 



|a-l 



2A I 



" exp \—\/^\(ij\)L b (\aj\) if A > 0, 



(A.4.2) 



as a,- — > oo. The results in Theorem 4 follow by taking derivative respect to \aA and b 



respectively. 
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A. 5 The Calculation of ir] 



For orthogonal designs, in with Laplace prior can be integrated out directly from (21) 



71"; 







J k 1 - exp 





1 dK3 



2tt\ 
2b 2 \ TJ 



1/2 



exp 




-3/2 



k. exp 



(A.5.1; 



where A = 1/b 2 and [i = w l/(6 2 a|), and the expression in the integral is the CDF of inverse 
Gaussian distribution. Borrowing the expression of the CDF of the inverse Gaussian, we 



then integrated out the integral to get expression (26). 

7Tj with horseshoe prior for orthogonal design can also be derived directly from (21) 



7T- 



f 1 i 




J k- exp 


"|d-%)" 



-k7 1/2 (1 - «;)~ 1/2 (1 - «< + b 2 K,j)~ x dK 3 
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^6 eXP V2 



5^(1 - «,.) 1/2 - 1 [(1 - 6" 2 )«i + b~T exp ( ) ^ 



a? 



(A.5.2) 



where the expression in the integral is the transformation of the hypergeometric inverted- 
beta distribution which was shown to be represented by degenerate hypergeometric functions 



(Gordy, 1998; Poison and Scott, 2010), thus we can follow Poison and Scott (2010) to express 



it as (27). 



A. 6 Lancaster and Salkauskas Basis for Natural Cubic Spline 



In this paper, we follow Chib and Greenberg (2010) to employ the cubic spline LS basis 



described by Lancaster and Salkauskas (1986). Consider the jth function fj(x), and let 

l,...,Kj quantile of x^,i 



v 



[v\j, VKjj) be the set of 100 x ^Mj %, &: 



1 n. 



Thus v\j = minj(xjj) and Uk j = maxj(xjj). Kj denotes the number of knots for the spline 
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functions. Then the cubic spline expansion of fj(x) is expressed as 



fj(Xij) = ^ >!*/,; (•'',., )///,/ + ^kj(Xij)s k j] 



k=l 



(A.6.1) 



where 



(pij, gKjj) T and Sj = (sy, s# ,j) T are the coefficients of this expression, 



Qjixij) = [$i j (x ij ),...$ Kjj (x ij )] T fmdtyj* 



x 



■■■^ Kjj{.Xij)] T are two basis vectors, 



and the basis functions {$kj( x )}k=i an d {^kj( x )} k =i are defined as 



K, 



$kj(x) oc < 



(A.6.2) 



$kj(x) oc < 



0, x < is k _ ld 

-(2/h 3 kj )(x - v k -i,j) 2 {x - u k j - 0.5h kj ), v k - hj <x< u kj 

( 2 / h l+i,j)( x ~ Vk+i,j) 2 {x - Vkj + 0.5/ifc+ij), v kj <x< v k+1;j 

x > v k +ij, 

0, x < v k - X j 

(l/h 2 kj )(x - z/ fc _i,j) 2 (x - v k j), v k -i,j <x< u kj 

{ l l h l+x,j){ x - Vk+i,j) 2 {x - v kj ), Vkj <X< Z/ fc+ ij 

x > v k+ ij, 

where h k j = v k j — v k -\,j- Note that ^>\j and &Kjj, ^ ' Kjj are defined by last two lines and 
first two lines of above expressions respectively, gj and Sj are interpreted the ordinate and 
slope of fj(x). Since fj(x) is a natural cubic splines with the second derivative equal to zero 
at two end points, and continuous derivative at knot points, both gj and Sj are constrained 



(Lancaster and Salkauskas, 1986) by s,- = A- Cjgj, where 



Aj 
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UJ 2 j 
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2 fi 2j 

fl 2 j 



u 2j 2 
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^Kj-lj 2 flK 3 -l,j 
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( — — -i- 

&2j fJ-2j ill] 

h,2j hij hzj h$j 
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h-K, 







/ 



i< , , h i<ii 

where Ukj = hkj/ {hhj + hk+i,j) and /ijy = 1 — ^kj for k = 2, Kj. With this constraints, Sj 



can be replaced from the function expression (A. 6.1) 



fjixij) = «!•, (.r, ; )' + ^ J (.r, l ) r .\ l Cj\{ 



tj (Xij)gj, 



(A.6.3) 



where tj(xjj) = (ty(xy), tKjj{ x ij)) — &j(xij) T + ^j(x i j) T Aj 1 Cj. Furthermore, consider 
the identifying constraints, J2k9kj = 0, we can express gij = — [gij + • • • + 9Kjj), thus 

fj{Xij) = t' (•'•,.; )g, = [hj{Xij) - I \ l (x, l )jj-2j H h I , ,■,,{. r,j) - l :j (.r, l ) ! n < t = z* T (x ij )^ j , 

where /3j = (g2j, ...,gKjj) T and we define matrix 

V Z f ( X nj) J 

Now the jth nonparametric function expressed by the natural cubic spline basis is fjfej) 



Z*(3j. In order to incorporate the assumption of a priori smoothness, |Chib and Greenberg 



(2010) consider a prior distribution on /3-'s as, 



where N is the Kj — 1 dimensional multivariate normal distribution, and 



(A.6.4) 



/ 



T, 



a 



<--.) 







o%l w , 
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a 



where two variance components and cr^- are selected here because of the different normal 
assumptions for the differences of the ordinates and the differences of slopes. Aj is given by 
/ 



A? 



V 



to 
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h 2j 


h 2j 


1 




h 2j 
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h 3j 









1 

/l3j 



1 
1 



1 1 






1 

h 2 j 






JL \ 

h 2j 






So far the construction of function fj(x.j) is exactly the same as Chib and Greenberg 



(2010). Note that fj(x.j) = Z*f3j with the prior of /3- given by (A. 6. 4) is equivalent to have 
fj(xj) = Z^AJ 1 ^ with \p 5 \ ~ N(0,Tj). Henceforth, we define the final n x Mj basis matrix 
Zj = ZjAj 1 such that /j(xj) = 2^/3^, where Mj = Kj — 1. Define r e j = cr~ 2 and ry = cr^ 2 , 
and modify the one variance component prior algorithm in Section [7j then we can easily 
employ the LS basis into BSAM. 
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